[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

ASF subversion and git services (JIRA) Sun, 20 Jul 2014 04:44:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067893#comment-14067893
 ]


ASF subversion and git services commented on LUCENE-5815:
---------------------------------------------------------

Commit 1612077 from [~mikemccand] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1612077 ]

LUCENE-5815: add TermAutomatonQuery

> Add TermAutomatonQuery, for proximity matching that generalizes 
> MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5815
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5815
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5815.patch, LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches.  This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery.  If you have two transitions in sequence, it's a phrase
> query of two terms.  You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength.  (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1).  But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton.  There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

Reply via email to