[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

Robert Muir (JIRA) Fri, 11 Jul 2014 05:55:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058730#comment-14058730
 ]


Robert Muir commented on LUCENE-5815:
-------------------------------------

{quote}
 (Of course, the index
can't store posLength, so the matching won't be fully correct if any
indexed tokens has posLength != 1).
{quote}

Why not? Cant we just have a tokenfilter that encodes positionLengthAttribute 
as a vInt payload (will always be one byte, unless you are crazy)? The custom 
scorer here could optionally support it.

Personally: not sure its worth it. I think its better to fix QP to parse 
correctly in common cases like word-delimiter etc (first: those tokenfilters 
must be fixed!).

And I'm a little confused if this approach is faster than rewrite() to booleans 
of phrase queries?

> Add TermAutomatonQuery, for proximity matching that generalizes 
> MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5815
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5815
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.10
>
>         Attachments: LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches.  This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery.  If you have two transitions in sequence, it's a phrase
> query of two terms.  You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength.  (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1).  But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton.  There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery

Reply via email to