[
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354659#comment-17354659
]
Haoyu Zhai edited comment on LUCENE-9983 at 5/31/21, 9:07 PM:
--------------------------------------------------------------
Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on
{{IntIntHashMap}}: [https://github.com/apache/lucene/pull/163]
I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it
successfully reduce the time it need before throwing the exception from 6 min
to 16 sec on my local machine (they both stoped at the same point as well).
I still kept the state array to be sorted when get it, so we'll be slower when
actually getting array but way faster on putting/removing keys. I'm not quite
sure why the speed up is this much, but my guess is we're doing way more
operations and spending way more times on increasing/decreasing state count and
putting/removing states from the set than introducing new states?
was (Author: zhai7631):
Thanks [~mikemccand] and [~dweiss]. I've opened a PR based on
{{IntIntHashMap}}: [https://github.com/apache/lucene/pull/162]
I've applied the test attached in LUCENE-9981 to verify this PR helps. Seems it
successfully reduce the time it need before throwing the exception from 6 min
to 16 sec on my local machine (they both stoped at the same point as well).
I still kept the state array to be sorted when get it, so we'll be slower when
actually getting array but way faster on putting/removing keys. I'm not quite
sure why the speed up is this much, but my guess is we're doing way more
operations and spending way more times on increasing/decreasing state count and
putting/removing states from the set than introducing new states?
> Stop sorting determinize powersets unnecessarily
> ------------------------------------------------
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all
> subsets of NFA states that "belong" in the same determinized state, using
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically
> freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep
> these growing maps of int key, int value sorted by key, e.g. upgrading to a
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here! Really all we need is the
> ability to add/delete keys from the map, and hashCode / equals (by key only –
> ignoring value!), and to freeze the map (a small optimization that we could
> skip initially). We only use these maps to lookup in the (growing)
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from
> [HPPC|https://github.com/carrotsearch/hppc]? And then change its
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial)
> regexps we saw on LUCENE-9981.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]