[
https://issues.apache.org/jira/browse/LUCENE-5752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034513#comment-14034513
]
Michael McCandless commented on LUCENE-5752:
--------------------------------------------
Thanks Rob.
bq. concatenate: as mentioned before, we rely on this today in quite a few
places, and now the runtime has significantly changed (when the left side is a
singleton)
Well, in RegExp we followup that concatenate with a minimize. In
WildcardQuery the incoming automata are small anyway... and I fixed
LevA to insert the prefix itself to avoid the full copy of the fuzzy
suffix part..
bq. singleton: speaking of such, this optimization is removed, but are we sure
about this? In practice this is probably extremely effective, maybe even
outweighing any other optimizations we could do.
I really didn't like this duality / mutability (how you sometimes had
to call expandSingleton for ops that cared) and I don't see where this
opto would really make a difference in Lucene. We already have
DaciukMihov to efficiently build minimal union automaton ...
I agree for a general purpose automaton library this might make sense
... but I don't think it really helps Lucene.
bq. regex/wildcard parsing: we should really test that this isn't totally crazy
(read: blowing up) now.
I was worried about this too but when I looked at RegExp it calls
minimize after all of these ops! So I think the added cost of the
copy is likely in the noise ...
bq. acceptStates: should this really be a hashset? is there a reason not to use
a bitset?
Hmm it could be a bitset. I thought that typically the number of
accept states is small, but I agree in the case when it's large it'd
be nice to not use way way too much RAM ... I'll change it to bitset.
> Explore light weight Automaton replacement
> ------------------------------------------
>
> Key: LUCENE-5752
> URL: https://issues.apache.org/jira/browse/LUCENE-5752
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0
>
> Attachments: LUCENE-5752.patch
>
>
> This effort started with the patch on LUCENE-4556, to create a "light
> weight" replacement for the current object-heavy Automaton class
> (which creates separate State and Transition objects).
> I took that initial patch much further, and cutover most places in
> Lucene that use Automaton to LightAutomaton. Tests pass.
> The core idea of LightAutomaton is all states are ints, and you build
> up the automaton under the restriction that you add all outgoing
> transitions one state at a time. This worked well for most
> operations, but for some (e.g. UTF32ToUTF8!!) it was harder, so I also
> added a separate builder to add transitions in any order and then in
> the end they are sorted and added to the real automaton.
> If this is successful I think we should just replace the current
> Automaton with LightAutomaton; right now they both exist in my current
> patch...
> This is very much a work in progress, and I'm not sure the
> restrictions the API imposes are "reasonable" (some algos got uglier).
> But I think it's at least worth exploring/iterating... I'll make a branch and
> commit my current state.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]