[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358418#comment-17358418
 ] 

Bruno Roustant commented on LUCENE-9983:
----------------------------------------

[~zhai7631] do you have some stats about the numbers of NFA / DFA states 
manipulated?

> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
>                 Key: LUCENE-9983
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9983
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to