[
https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067893#comment-14067893
]
ASF subversion and git services commented on LUCENE-5815:
---------------------------------------------------------
Commit 1612077 from [~mikemccand] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1612077 ]
LUCENE-5815: add TermAutomatonQuery
> Add TermAutomatonQuery, for proximity matching that generalizes
> MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-5815
> URL: https://issues.apache.org/jira/browse/LUCENE-5815
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, 4.10
>
> Attachments: LUCENE-5815.patch, LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches. This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery. If you have two transitions in sequence, it's a phrase
> query of two terms. You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength. (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1). But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton. There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]