[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

Robert Muir (JIRA) Fri, 11 Jul 2014 06:27:25 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058754#comment-14058754
 ]


Robert Muir commented on LUCENE-5815:
-------------------------------------

{quote}
We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token 
transition, and if it's finite, right? Or maybe we could somehow take ANY and 
map it to slop on the phrase queries?
{quote}

Hmm, ok i see what you are getting at.

I guess I was immediately only considering the case where the automaton is 
actually coming from query analysis chain: it would always be finite and so on 
in this case... right?

> Add TermAutomatonQuery, for proximity matching that generalizes 
> MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5815
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5815
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.10
>
>         Attachments: LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches.  This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery.  If you have two transitions in sequence, it's a phrase
> query of two terms.  You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength.  (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1).  But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton.  There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

Reply via email to