[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2018-09-27 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630529#comment-16630529
 ] 

Michael Gibney edited comment on LUCENE-7398 at 9/27/18 3:01 PM:
-

I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # [Anticipate some possible 
consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/]
 of new functionality that would be enabled by this (or other equivalent) fix

Some notes:
 # The branch contains (and passes) all tests proposed so far in association 
with this issue (and also quite a few additional tests)
 # The candidate implementation is made more complete and performant by the 
addition of some extra information in the index (e.g., {{positionLength}}). 
This extra information is currently stored using {{Payload}} s, though for 
{{positionLength}} at least, there has been some discussion of integrating it 
more directly in the index (see LUCENE-4312, LUCENE-3843)
 # Some version of this code has been running in production for several months, 
and has given no indication of instability, even running every user phrase 
query (both explicit and {{pf}}) as a graph query.
 # To facilitate evaluation, the fix is integrated in 
[master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], 
[branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], 
[branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5],
 and 
[branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4].


was (Author: mgibney):
I have a branch containing a candidate fix for this issue: 
[LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master]

It includes support for complete graph-based matching, configurable to include:
 # all valid top-level {{startPosition}} s
 # all valid match lengths (in the {{startPosition - endPosition}} sense)
 # all valid match {{width}} s (in the slop sense)
 # all redundant matches (different {{Term}} s, same {{startPosition}}, 
{{endPosition}}, and {{width}})
 # all possible valid combinations of subclause positions

Option 1 is appropriate for top-level matching and document matching (and is 
complete for that use case); options 2/3 may be used in subclauses to guarantee 
complete matching of parent {{Spans}}; option 4 results in very thorough 
scoring. Option 5 would be an unusual use case; but I think there are some 
applications for full combinatoric matching, and the option was well supported 
by the implementation, so it is included for the sake of completeness.

The candidate implementation models the match graph as a kind of 2-dimensional 
queue that supports random-access seek and arbitrary node removal. A more 
thorough explanation would be unwieldy in a comment, so I wrote [three 
posts|https://michaelgibney.net/lucene/graph/], which respectively:
 # [Provide some 
background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the 
problem associated with LUCENE-7398 (this post is heavily informed by the 
discussion on this issue)
 # [Describe the candidate 
implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in 
some detail (also includes information on how to configure/test/evaluate)
 # 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2017-03-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892972#comment-15892972
 ] 

Paul Elschot edited comment on LUCENE-7398 at 3/2/17 8:54 PM:
--

One way to view the problem is that when span end positions are used to 
determine the slop, it becomes impossible to determine an order for moving the 
subspans to a next position.

So one direction out of this could be: use NearSpans that determines the slop 
only by the start positions of the subspans. That leaves only the cases in 
which the subspans can start (and maybe also end) at the same position.
To make sure that all the subspans move forward after a match we could move 
them all forward until after the current match, and while doing that also 
count/collect them for scoring/highlighting as long as they are within the 
match. That should solve the bug reported here, which is about scoring a missed 
matching occurrence.

This limits the required slop to using only the starting positions of the 
subspans. Could this work?



was (Author: paul.elsc...@xs4all.nl):
On way to view the problem is that when span end positions are used to 
determine the slop, it becomes impossible to determine an order for moving the 
subspans to a next position.

So one direction out of this could be: use NearSpans that determines the slop 
only by the start positions of the subspans. That leaves only the cases in 
which the subspans can start (and maybe also end) at the same position.
To make sure that all the subspans move forward after a match we could move 
them all forward until after the current match, and while doing that also 
count/collect them for scoring/highlighting as long as they are within the 
match. That should solve the bug reported here, which is about scoring a missed 
matching occurrence.

This limits the required slop to using only the starting positions of the 
subspans. Could this work?


> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, 
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-25 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519320#comment-15519320
 ] 

Paul Elschot edited comment on LUCENE-7398 at 9/25/16 10:00 PM:


Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be 
ignored.

This introduces SpanNearQuery.MatchNear to choose the matching method.

The ORDERED_LAZY case is still the patch of 14 August, this should be changed 
back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.

This implements MatchNear.UNORDERED_STARTPOS and uses that as the default 
implementation for the unordered case.
The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, 
which is simpler than the current NearSpansUnordered, there is no SpansCell.
I'd expect this StartPos implementation to be a little faster, so I also 
implemented it as default for the unordered case.  In only one test case the 
UNORDERED_LAZY method is needed to pass the test.

The question is whether it is ok to change the default unordered implementation 
to only use the span start positions.

The collect() method is moved to the superclass ConjunctionSpans, this 
simplification might be done at another issue.


was (Author: paul.elsc...@xs4all.nl):
Patch of 24 Sep 2016, work in progress.

This introduces SpanNearQuery.MatchNear to choose the matching method.

The ORDERED_LAZY case is still the patch of 14 August, this should be changed 
back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.

This implements MatchNear.UNORDERED_STARTPOS and uses that as the default 
implementation for the unordered case.
The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, 
which is simpler than the current NearSpansUnordered, there is no SpansCell.
I'd expect this StartPos implementation to be a little faster, so I also 
implemented it as default for the unordered case.  In only one test case the 
UNORDERED_LAZY method is needed to pass the test.

The question is whether it is ok to change the default unordered implementation 
to only use the span start positions.

The collect() method is moved to the superclass ConjunctionSpans, this 
simplification might be done at another issue.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, 
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, 
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-19 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504686#comment-15504686
 ] 

Paul Elschot edited comment on LUCENE-7398 at 9/19/16 9:22 PM:
---

The idea is to allow full backward compatibility, as well as more matching 
methods:

UNORDERED_LAZY is the current unordered,
UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it 
should be complete.
ORDERED_LAZY is the current ordered,
ORDERED_LOOKAHEAD is in the patch of 14 August 2016,
ORDERED_STARTPOS also only uses start positions, so it should be complete.

The complete ORDERED and UNORDERED cases that use start and end positions and 
need backtracking are left for later.

Comments?


was (Author: paul.elsc...@xs4all.nl):
The idea is to allow full backward compatibility, as well as more matching 
methods:

UNORDERED_LAZY is the current unordered,
UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it 
should be complete.
ORDERED_LAZY is the current ordered,
ORDERED_LOOKAHEAD is in the patch of 14 August 2016,
ORDERED_STARTPOS is also only uses start positions, so it should be complete.

The complete ORDERED and UNORDERED cases that use start and end positions and 
need backtracking are left for later.

Comments?

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-19 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504663#comment-15504663
 ] 

Paul Elschot edited comment on LUCENE-7398 at 9/19/16 8:59 PM:
---

I have started on working on a SpanNearQuery that contains this:
{code}
  /** Specifies how clauses are to occur near each other in matching documents. 
*/
  public static enum MatchNear {

/** Use this method for clauses that match when they are not ordered,
 * and the slop should be determined between the end and start positions of 
all clauses.
 * When the subspans vary in length, some matches may not be found.
 */
UNORDERED_LAZY,

/** Use this method for clauses that match when they are not ordered,
 * and the slop should be determined between the start positions of the 
first and last matching clauses.
 */
UNORDERED_STARTPOS,

/** Use this method for clauses that can match when they are ordered and 
span collection is needed,
 * and the slop should be determined between the end and start positions of 
the clauses.
 * When the subspans vary in length, some matches may not be found.
 */
ORDERED_LAZY,

/** Use this method for clauses that can match when they are ordered and 
span collection is needed,
 * and the slop should be determined between the end and start positions of 
the clauses.
 * When the subspans vary in length, some matches may not be found,
 * however this method finds more matches than {@link ORDERED_LAZY}.
 */
ORDERED_LOOKAHEAD,

/** Use this method for clauses that match when they are ordered,
 * and the slop should be determined between the start positions of the 
first and last matching clauses.
 */
ORDERED_STARTPOS
  }

{code}



was (Author: paul.elsc...@xs4all.nl):
I have started on working on a SpanNearQuery that contains this:
{code}
  /** Specifies how clauses are to occur near each other in matching documents. 
*/
  public static enum MatchNear {

/** Use this method for clauses that match when they are not ordered,
 * and the slop should be determined between the end and start positions of 
all clauses.
 * When the subspans vary in length, some matches may not be found.
 */
UNORDERED,

/** Use this method for clauses that match when they are not ordered,
 * and the slop should be determined between the start positions of the 
first and last matching clauses.
 */
UNORDERED_STARTPOS,

/** Use this method for clauses that can match when they are ordered and 
span collection is needed,
 * and the slop should be determined between the end and start positions of 
the clauses.
 * When the subspans vary in length, some matches may not be found.
 */
ORDERED_LAZY,

/** Use this method for clauses that can match when they are ordered and 
span collection is needed,
 * and the slop should be determined between the end and start positions of 
the clauses.
 * When the subspans vary in length, some matches may not be found,
 * however this method finds more matches than {@link ORDERED_LAZY}.
 */
ORDERED_LOOKAHEAD,

/** Use this method for clauses that match when they are ordered,
 * and the slop should be determined between the start positions of the 
first and last matching clauses.
 */
ORDERED_STARTPOS
  }

{code}


> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 3:37 PM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match don't match. This is caused by subspans of SpanNearQuery 
having a variable match length. For these cases the lazy iteration 
implementation which tries to optimize the number of comparisons of subspan 
matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unordered case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unordered and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least for the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:36 AM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match aer not matching. They are caused by subspans of 
SpanNearQuery having a variable match length. For these cases the lazy 
iteration implementation which tries to optimize the number of comparisons of 
subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:14 AM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. They are 
caused by subspans of SpanNearQuery having a variable match length. For these 
cases the lazy iteration implementation which tries to optimize the number of 
comparisons of subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465301#comment-15465301
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/5/16 4:07 PM:
--

Paul's  20160814 patch  almost convinced me. Unfortunately, it does not fix the 
case when an intermediate span has a longer match that reduces overall 
sloppyness but overlaps with a match of a subsequent span and consequently 
requires advancing the subsequent span. Here is an example 

Document: w1 w2 w3 w4 w5
near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

{code}
SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 
0, true);
spans = q1q2234q445.createWeight(searcher, false, 
1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));
{code}

I think we can only fix it if we get give up lazy iteration. I don't think this 
is so bad for performance. If we implement a clever caching for positions in 
spans a complete backtracking would only consist of making a few additional 
int-comparisons. The expensive operation is iterating over all span positions 
(IO) and we do this already in advancePosition(Spans, int), aren't we. 


was (Author: gol...@detego-software.de):
Paul's fix almost convinced me. Unfortunately, it does not fix the case when an 
intermediate span has a longer match that reduces overall sloppyness but 
overlaps with a match of a subsequent span and consequently requires advancing 
the subsequent span. Here is an example 

Document: w1 w2 w3 w4 w5
near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

{code}
SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 
0, true);
spans = q1q2234q445.createWeight(searcher, false, 
1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));
{code}

I think we can only fix it if we get give up lazy iteration. I don't think this 
is so bad for performance. If we implement a clever caching for positions in 
spans a complete backtracking would only consist of making a few additional 
int-comparisons. The expensive operation is iterating over all span positions 
(IO) and we do this already in advancePosition(Spans, int), aren't we. 

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:21 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposition Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:20 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposition Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:19 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:18 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himnself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-02 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404699#comment-15404699
 ] 

Paul Elschot edited comment on LUCENE-7398 at 8/2/16 8:17 PM:
--

The patch changes the order in SpanPositionQueue which is used by SpanOr.
This works to pass the test case, and it does not increase complexity.

I think the problem is in NearSpansOrdered.stretchToOrder() which only does 
this:
{code}matchEnd = subSpans[subSpans.length - 1].endPosition();{code}
What is should also do is lookahead to see whether there is an ordered match 
with a smaller slop.

It could be that there still is a failing case with a nested SpanOr, possibly 
containing another nested SpanNear, but I'm not sure, this is tricky.

Since looking ahead increases the complexity (normal case runtime) I'd prefer 
to have the patch applied now, and see what the future brings.



was (Author: paul.elsc...@xs4all.nl):
The patch changes the order in SpanPositionQueue which is used by SpanOr.
This works to pass the test case, and it does not increase complexity.

I think the problem is in NearSpansOrdered.stretchToOrder() which only does 
this:
{code}matchEnd = subSpans[subSpans.length - 1].endPosition();{code}
What is should also do is lookahead to see whether there is an ordered match 
with a smaller slop.

It could be that there still is a failing case with a nested SpanOr, possibly 
over containing another nested SpanNear, but I'm not sure, this is tricky.

Since looking ahead increases the complexity (normal case runtime) I'd prefer 
to have the patch applied now, and see what the future brings.


> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-07-29 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399414#comment-15399414
 ] 

Christoph Goller edited comment on LUCENE-7398 at 7/29/16 2:40 PM:
---

Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that 
shows the problem.


was (Author: gol...@detego-software.de):
Please find attatched an extended TestSpanCollection.java that shows the 
problem.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Priority: Critical
> Attachments: TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org