[
https://issues.apache.org/jira/browse/LUCENE-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266442#comment-13266442
]
Dawid Weiss commented on LUCENE-3830:
-------------------------------------
bq. Hmm I don't quite understand ... can you describe more?
I'm on vacation this week, so short: from the brief code inspection I concluded
you're using a search from each position to get the maximum match, is this
right? If so, the pessimistic time is quite bad for patterns like "aaa*b" and
input strings "aaaa*c" (replace * with a large number of repeated sequences).
The way I would try to implement it is to do a state-tracking much like in
non-deterministic automata, where you have a stack of "active" states which you
follow in the automaton and you move them from state to state on each input
symbol. If there a given state fires as a match then you have some conditions
to check -- are there any states that may be potentially longer but haven't
fired yet (if so, you need to delay this firing state because it can be
subsumed by others), otherwise when the no other active state blocks that one
you can do the replacement.
I haven't really thought about how twisted the logic above can become but I
suspect it's not going to be really that bad. The gain is that each input
symbol advances your state (at most linearly with the number of active states).
It helps with degenerate cases like the one above. I suppose what Mihov et al.
do is to statically (on the fst) determine which states can lead to this
"deferred match queue" and simply eliminate them, but haven't really looked
into it.
It's an improvement, doesn't need to be implemented right away. Sorry for being
brief -- the network is flaky here, I'm in the middle of nowhere.
> MappingCharFilter could be improved by switching to an FST.
> -----------------------------------------------------------
>
> Key: LUCENE-3830
> URL: https://issues.apache.org/jira/browse/LUCENE-3830
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Dawid Weiss
> Assignee: Michael McCandless
> Priority: Minor
> Labels: gsoc2012, lucene-gsoc-12
> Fix For: 4.0
>
> Attachments: LUCENE-3830.patch
>
>
> MappingCharFilter stores an overly complex tree-like structure for matching
> input patterns. The input is a union of fixed strings mapped to a set of
> fixed strings; an fst matcher would be ideal here and provide both memory and
> speed improvement I bet.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]