[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

Michael McCandless (JIRA) Tue, 17 Jan 2017 03:32:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825894#comment-15825894
 ]


Michael McCandless commented on LUCENE-7638:
--------------------------------------------

This is an impressive patch!  Thanks [~jim.ferenczi].

I agree the combinatoric explosion is a serious problem/vulnerability/adversary 
and I love how this patch solves it.

Can this handle "side paths on side paths" graph structure (I think you called 
this "nested multi-term synonyms")?  While no analysis chain can naturally 
produce this today (that I know of!), the {{TokenStream}} attributes can easily 
express it.  And you could imagine it happening in the future, e.g. if you use 
Kuromoji tokenizer or {{WordDelimiterGraphFilter}} followed by a 
{{SynonymGraphFilter}} (except we'd need to e.g. use the synonym graph filter 
from LUCENE-5012, which can correctly consume a graph).  If this is expected to 
work maybe we should add a test case showing that?

It seems like you don't need to be using {{Term}} here, except at the end to 
pass to the {{newXXXQuery}}, since everything is in a single field here, and we 
are hoping to move away from {{Term}} entirely (LUCENE-7632)?

Holes are challenging for graph token streams ... can you add a test case that 
encounters holes, e.g. simulated {{StopFilter}}?  There are at least two "fun" 
cases: a hole that cuts the graph entirely into two partitions, and a synonym 
spanning over a hole ... {{CannedTokenStream}} is useful for feeding such 
"interesting" cases.

The {{Path.id}} seems to be used only for tie-breaking on compare, not for 
lookup in the {{TreeSet}} as the comment suggests?

On {{minShouldMatch}}: I feel it's actually more correct to define its 
semantics as operating on whole synonyms, not the individual tokens in each 
synonym, as this patch does.  Really, had you done the same synonym reduction 
at indexing time instead, so that "new york" in a document caused "ny" to be 
indexed as well, and then searched on "ny", this is the behavior you would see 
("new york" counts as the 1 {{minShouldMatch}}).

Of course, in general I think we are in new (to Lucene) territory here, on how 
graph structured queries should be matched/scored "properly", and I don't 
really know what the answer should be. {{TermAutomatonQuery}} [faces similar 
challenges|https://issues.apache.org/jira/browse/LUCENE-5815?focusedCommentId=14060665&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14060665].
  Maybe there are some IR papers out there that have studied this?

I even think it's odd that we don't use a {{PhraseQuery}} to match that 
syn-expanded "new york" even when the user didn't put double quotes around the 
query: had you done that syn at index time, it would "act like" a 
{{PhraseQuery}} ... but this is an unrelated issue!


> Optimize graph query produced by QueryBuilder
> ---------------------------------------------
>
>                 Key: LUCENE-7638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7638
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

Reply via email to