[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630529#comment-16630529
 ] 

Michael Gibney edited comment on LUCENE-7398 at 9/27/18 3:01 PM:
-----------------------------------------------------------------

I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # [Anticipate some possible 
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
 of new functionality that would be enabled by this (or other equivalent) fix

Some notes:
 # The branch contains (and passes) all tests proposed so far in association 
with this issue (and also quite a few additional tests)
 # The candidate implementation is made more complete and performant by the 
addition of some extra information in the index (e.g., {{positionLength}}). 
This extra information is currently stored using {{Payload}} s, though for 
{{positionLength}} at least, there has been some discussion of integrating it 
more directly in the index (see LUCENE-4312, LUCENE-3843)
 # Some version of this code has been running in production for several months, 
and has given no indication of instability, even running every user phrase 
query (both explicit and {{pf}}) as a graph query.
 # To facilitate evaluation, the fix is integrated in 
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], 
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], 
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
 and 
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].


was (Author: mgibney):
I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # [Anticipate some possible 
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
 of new functionality that would be enabled by this (or other equivalent) fix

Some notes:
 # The branch contains (and passes) all tests proposed so far in association 
with this issue (and also quite a few additional tests)
 # The candidate implementation is made more complete and performant by the 
addition of some extra information in the index (e.g., {{positionLength}}). 
This extra information is currently stored using {{Payload}} s, though for 
{{positionLength}} at least, there has been some discussion of integrating it 
more directly in the index (see LUCENE-4312)
 # Some version of this code has been running in production for several months, 
and has given no indication of instability, even running every user phrase 
query (both explicit and {{pf}}) as a graph query.
 # To facilitate evaluation, the fix is integrated in 
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], 
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], 
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
 and 
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].

> Nested Span Queries are buggy
> -----------------------------
>
>                 Key: LUCENE-7398
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7398
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 5.5
>            Reporter: Christoph Goller
>            Assignee: Alan Woodward
>            Priority: Critical
>         Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, 
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to