[
https://issues.apache.org/jira/browse/LUCENE-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14058749#comment-14058749
]
Michael McCandless commented on LUCENE-5815:
--------------------------------------------
bq. Cant we just have a tokenfilter that encodes positionLengthAttribute as a
vInt payload (will always be one byte, unless you are crazy)? The custom scorer
here could optionally support it.
Yes, I think that would work! And would be pretty simple to build... and the
changes to this scorer would be simple: right now it just hardwires that a
given token goes from pos to pos+1, but with this vInt in the payload it would
decode that and use it instead of +1.
bq. I think its better to fix QP to parse correctly in common cases like
word-delimiter etc (first: those tokenfilters must be fixed!).
Right, QP needs to use posLength to build the correct queries... this new query
just makes it "easy" since any arbitrary graph TokenStream can be directly
translated into the equivalent query.
bq. And I'm a little confused if this approach is faster than rewrite() to
booleans of phrase queries?
We can only rewrite to BQ of PQ if the automaton doesn't use the ANY token
transition, and if it's finite, right? Or maybe we could somehow take ANY and
map it to slop on the phrase queries?
But in those restricted cases, it's probably faster, I guess depending on what
the automaton looks like. Ie, you could make a biggish automaton that rewrites
to many many phrase queries. I'll add a TODO to maybe do this rewriting for
this query ...
> Add TermAutomatonQuery, for proximity matching that generalizes
> MultiPhraseQuery/SpanNearQuery
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-5815
> URL: https://issues.apache.org/jira/browse/LUCENE-5815
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.10
>
> Attachments: LUCENE-5815.patch
>
>
> I created a new query, called TermAutomatonQuery, that's a proximity
> query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
> construct an arbitrary automaton whose transitions are whole terms, and
> then find all documents that the automaton matches. This is different
> from a "normal" automaton whose transitions are usually
> bytes/characters within a term/s.
> So, if the automaton has just 1 transition, it's just an expensive
> TermQuery. If you have two transitions in sequence, it's a phrase
> query of two terms. You can express synonyms by using transitions
> that overlap one another but the automaton doesn't have to be a
> "sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
> (at query time).
> It also allows "any" transitions, to match any term, so you can do
> sloppy matching and span-like queries, e.g. find "lucene" and "python"
> with up to 3 other terms in between.
> I also added a class to convert a TokenStream directly to the
> automaton for this query, preserving posLength. (Of course, the index
> can't store posLength, so the matching won't be fully correct if any
> indexed tokens has posLength != 1). But if you do query-time-only
> synonyms then the matching should finally be correct.
> I haven't tested performance but I suspect it's quite slowish ... its
> cost is O(sum-totalTF) of all terms "used" in the automaton. There
> are some optimizations we could do, e.g. detecting that some terms in
> the automaton can be upgraded to MUST (right now they are all
> effectively SHOULD).
> I'm not sure how it should assign scores (punted on that for now), but
> the matching seems to be working.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]