Hi Mike, Looking at the algorithm it's the opposite - it tries hard not to iterate over all the combinations. The "power sets" come from combinations of accessible sets of states using different NFA paths. This example is pretty good:
https://en.wikipedia.org/wiki/Powerset_construction#Example Dawid On Tue, Jun 1, 2021 at 1:11 AM Michael Sokolov <[email protected]> wrote: > I haven't looked at this at all, so please disregard if I'm off base, but > it sounds as if we are materializing power sets? Would it be better to just > store an array of the members and generate the combinations iteratively on > demand? > > On Mon, May 31, 2021, 5:27 PM Haoyu Zhai (Jira) <[email protected]> wrote: > >> >> [ >> https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660 >> ] >> >> Haoyu Zhai commented on LUCENE-9983: >> ------------------------------------ >> >> I've added a simple static counter just for the adversarial test, and >> here's the stats: >> * {{incr}} called: 106073079 >> * entry added to set: 100076079 >> * {{decr}} called: 106069079 >> * entry removed from set: 100072079 >> * {{computeHash}} called: 40057 >> * {{freeze}} called: 14056 >> >> So seems to me my guess above holds, we're doing way more put/remove >> entry operations than others >> >> > Stop sorting determinize powersets unnecessarily >> > ------------------------------------------------ >> > >> > Key: LUCENE-9983 >> > URL: https://issues.apache.org/jira/browse/LUCENE-9983 >> > Project: Lucene - Core >> > Issue Type: Improvement >> > Reporter: Michael McCandless >> > Priority: Major >> > Time Spent: 40m >> > Remaining Estimate: 0h >> > >> > Spinoff from LUCENE-9981. >> > Today, our {{Operations.determinize}} implementation builds powersets >> of all subsets of NFA states that "belong" in the same determinized state, >> using [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction >> ]. >> > To hold each powerset, we use a malleable {{SortedIntSet}} and >> periodically freeze it to a {{FrozenIntSet}}, also sorted. We pay a high >> price to keep these growing maps of int key, int value sorted by key, e.g. >> upgrading to a {{TreeMap}} once the map is large enough (> 30 entries). >> > But I think sorting is entirely unnecessary here! Really all we need >> is the ability to add/delete keys from the map, and hashCode / equals (by >> key only – ignoring value!), and to freeze the map (a small optimization >> that we could skip initially). We only use these maps to lookup in the >> (growing) determinized automaton whether this powerset has already been >> seen. >> > Maybe we could simply poach the {{IntIntScatterMap}} implementation >> from [HPPC|https://github.com/carrotsearch/hppc]? And then change its >> {{hashCode}}/{{equals }}to only use keys (not values). >> > This change should be a big speedup for the kinds of (admittedly >> adversarial) regexps we saw on LUCENE-9981. >> > >> >> >> >> -- >> This message was sent by Atlassian Jira >> (v8.3.4#803005) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
