Hi Mike,

Looking at the algorithm it's the opposite - it tries hard not to iterate
over all the combinations. The "power sets" come from combinations of
accessible sets of states using different NFA paths. This example is pretty
good:

https://en.wikipedia.org/wiki/Powerset_construction#Example

Dawid

On Tue, Jun 1, 2021 at 1:11 AM Michael Sokolov <[email protected]> wrote:

> I haven't looked at this at all, so please disregard if I'm off base, but
> it sounds as if we are materializing power sets? Would it be better to just
> store an array of the members and generate the combinations iteratively on
> demand?
>
> On Mon, May 31, 2021, 5:27 PM Haoyu Zhai (Jira) <[email protected]> wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354660#comment-17354660
>> ]
>>
>> Haoyu Zhai commented on LUCENE-9983:
>> ------------------------------------
>>
>> I've added a simple static counter just for the adversarial test, and
>> here's the stats:
>>  * {{incr}} called: 106073079
>>  * entry added to set: 100076079
>>  * {{decr}} called: 106069079
>>  * entry removed from set: 100072079
>>  * {{computeHash}} called: 40057
>>  * {{freeze}} called: 14056
>>
>> So seems to me my guess above holds, we're doing way more put/remove
>> entry operations than others
>>
>> > Stop sorting determinize powersets unnecessarily
>> > ------------------------------------------------
>> >
>> >                 Key: LUCENE-9983
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>> >             Project: Lucene - Core
>> >          Issue Type: Improvement
>> >            Reporter: Michael McCandless
>> >            Priority: Major
>> >          Time Spent: 40m
>> >  Remaining Estimate: 0h
>> >
>> > Spinoff from LUCENE-9981.
>> > Today, our {{Operations.determinize}} implementation builds powersets
>> of all subsets of NFA states that "belong" in the same determinized state,
>> using [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction
>> ].
>> > To hold each powerset, we use a malleable {{SortedIntSet}} and
>> periodically freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high
>> price to keep these growing maps of int key, int value sorted by key, e.g.
>> upgrading to a {{TreeMap}} once the map is large enough (> 30 entries).
>> > But I think sorting is entirely unnecessary here!  Really all we need
>> is the ability to add/delete keys from the map, and hashCode / equals (by
>> key only – ignoring value!), and to freeze the map (a small optimization
>> that we could skip initially).  We only use these maps to lookup in the
>> (growing) determinized automaton whether this powerset has already been
>> seen.
>> > Maybe we could simply poach the {{IntIntScatterMap}} implementation
>> from [HPPC|https://github.com/carrotsearch/hppc]?  And then change its
>> {{hashCode}}/{{equals }}to only use keys (not values).
>> > This change should be a big speedup for the kinds of (admittedly
>> adversarial) regexps we saw on LUCENE-9981.
>> >
>>
>>
>>
>> --
>> This message was sent by Atlassian Jira
>> (v8.3.4#803005)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Reply via email to