[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630529#comment-16630529
]
Michael Gibney edited comment on LUCENE-7398 at 9/27/18 3:01 PM:
-----------------------------------------------------------------
I have a branch containing a candidate fix for this issue:
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]
It includes support for complete graph-based matching, configurable to include:
# all valid top-level {{startPosition}} s
# all valid match lengths (in the {{startPosition - endPosition}} sense)
# all valid match {{width}} s (in the slop sense)
# all redundant matches (different {{Term}} s, same {{startPosition}},
{{endPosition}}, and {{width}})
# all possible valid combinations of subclause positions
Option 1 is appropriate for top-level matching and document matching (and is
complete for that use case); options 2/3 may be used in subclauses to guarantee
complete matching of parent {{Spans}}; option 4 results in very thorough
scoring. Option 5 would be an unusual use case; but I think there are some
applications for full combinatoric matching, and the option was well supported
by the implementation, so it is included for the sake of completeness.
The candidate implementation models the match graph as a kind of 2-dimensional
queue that supports random-access seek and arbitrary node removal. A more
thorough explanation would be unwieldy in a comment, so I wrote [three
posts|https://michaelgibney.net/lucene/graph/], which respectively:
# [Provide some
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the
problem associated with LUCENE-7398 (this post is heavily informed by the
discussion on this issue)
# [Describe the candidate
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in
some detail (also includes information on how to configure/test/evaluate)
# [Anticipate some possible
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
of new functionality that would be enabled by this (or other equivalent) fix
Some notes:
# The branch contains (and passes) all tests proposed so far in association
with this issue (and also quite a few additional tests)
# The candidate implementation is made more complete and performant by the
addition of some extra information in the index (e.g., {{positionLength}}).
This extra information is currently stored using {{Payload}} s, though for
{{positionLength}} at least, there has been some discussion of integrating it
more directly in the index (see LUCENE-4312, LUCENE-3843)
# Some version of this code has been running in production for several months,
and has given no indication of instability, even running every user phrase
query (both explicit and {{pf}}) as a graph query.
# To facilitate evaluation, the fix is integrated in
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master],
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x],
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
and
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].
was (Author: mgibney):
I have a branch containing a candidate fix for this issue:
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]
It includes support for complete graph-based matching, configurable to include:
# all valid top-level {{startPosition}} s
# all valid match lengths (in the {{startPosition - endPosition}} sense)
# all valid match {{width}} s (in the slop sense)
# all redundant matches (different {{Term}} s, same {{startPosition}},
{{endPosition}}, and {{width}})
# all possible valid combinations of subclause positions
Option 1 is appropriate for top-level matching and document matching (and is
complete for that use case); options 2/3 may be used in subclauses to guarantee
complete matching of parent {{Spans}}; option 4 results in very thorough
scoring. Option 5 would be an unusual use case; but I think there are some
applications for full combinatoric matching, and the option was well supported
by the implementation, so it is included for the sake of completeness.
The candidate implementation models the match graph as a kind of 2-dimensional
queue that supports random-access seek and arbitrary node removal. A more
thorough explanation would be unwieldy in a comment, so I wrote [three
posts|https://michaelgibney.net/lucene/graph/], which respectively:
# [Provide some
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the
problem associated with LUCENE-7398 (this post is heavily informed by the
discussion on this issue)
# [Describe the candidate
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in
some detail (also includes information on how to configure/test/evaluate)
# [Anticipate some possible
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
of new functionality that would be enabled by this (or other equivalent) fix
Some notes:
# The branch contains (and passes) all tests proposed so far in association
with this issue (and also quite a few additional tests)
# The candidate implementation is made more complete and performant by the
addition of some extra information in the index (e.g., {{positionLength}}).
This extra information is currently stored using {{Payload}} s, though for
{{positionLength}} at least, there has been some discussion of integrating it
more directly in the index (see LUCENE-4312)
# Some version of this code has been running in production for several months,
and has given no indication of instability, even running every user phrase
query (both explicit and {{pf}}) as a graph query.
# To facilitate evaluation, the fix is integrated in
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master],
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x],
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
and
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].
> Nested Span Queries are buggy
> -----------------------------
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 5.5
> Reporter: Christoph Goller
> Assignee: Alan Woodward
> Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]