[jira] [Updated] (LUCENE-7638) Optimize graph query produced by QueryBuilder

Jim Ferenczi (JIRA) Mon, 16 Jan 2017 06:50:41 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Ferenczi updated LUCENE-7638:
---------------------------------
    Description: 
The QueryBuilder creates a graph query when the underlying TokenStream contains 
token with PositionLengthAttribute greater than 1.
These TokenStreams are in fact graphs (lattice to be more precise) where 
synonyms can span on multiple terms. 
Currently the graph query is built by visiting all the path of the graph 
TokenStream. For instance if you have a synonym like "ny, new york" and you 
search for "new york city", the query builder would produce two pathes:
"new york city", "ny city"
This can quickly explode when the number of multi terms synonyms increase. 
The query "ny ny" for instance would produce 4 pathes and so on.
For boolean queries with should or must clauses it should be more efficient to 
build a boolean query that merges all the intersections in the graph. So 
instead of "new york city", "ny city" we could produce:
"+((+new +york) ny) +city"

The attached patch is a proposal to do that instead of the all path solution.
The patch transforms multi terms synonyms in graph query for each intersection 
in the graph. This is not done in this patch but we could also create a 
specialized query that gives equivalent scores to multi terms synonyms like the 
SynonymQuery does for single term synonyms.
For phrase query this patch does not change the current behavior but we could 
also use the new method to create optimized graph SpanQuery.

[~mattweber] I think this patch could optimize a lot of cases where multiple 
muli-terms synonyms are present in a single request. Could you take a look ?

  was:
The QueryBuilder now creates a graph query when the underlying TokenStream 
contains token with PositionLengthAttribute greater than 1.
These TokenStreams are in fact graphs (lattice to be more precise) where 
synonyms can span on multiple terms. 
Currently the graph query is built by visiting all the path of the graph 
TokenStream. For instance if you have a synonym like "ny, new york" and you 
search for "new york city", the query builder would produce two pathes:
"new york city", "ny city"
This can quickly explode when the number of multi terms synonyms increase. 
The query "ny ny" for instance would produce 4 pathes and so on.
For boolean queries with should or must clauses it should be more efficient to 
build a boolean query that merges all the intersections in the graph. So 
instead of "new york city", "ny city" we could produce:
"+((+new +york) ny) +city"

The attached patch is a proposal to do that instead of the all path solution.
The patch transforms multi terms synonyms in graph query for each intersection 
in the graph. This is not done in this patch but we could also create a 
specialized query that gives equivalent scores to multi terms synonyms like the 
SynonymQuery does for single term synonyms.
For phrase query this patch does not change the current behavior but we could 
also use the new method to create optimized graph SpanQuery.

[~mattweber] I think this patch could optimize a lot of cases where multiple 
muli-terms synonyms are present in a single request. Could you take a look ?


> Optimize graph query produced by QueryBuilder
> ---------------------------------------------
>
>                 Key: LUCENE-7638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7638
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-7638) Optimize graph query produced by QueryBuilder

Reply via email to