[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630529#comment-16630529 ] Michael Gibney edited comment on LUCENE-7398 at 9/27/18 3:01 PM: - I have a branch containing a candidate fix for this issue: [LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master] It includes support for complete graph-based matching, configurable to include: # all valid top-level {{startPosition}} s # all valid match lengths (in the {{startPosition - endPosition}} sense) # all valid match {{width}} s (in the slop sense) # all redundant matches (different {{Term}} s, same {{startPosition}}, {{endPosition}}, and {{width}}) # all possible valid combinations of subclause positions Option 1 is appropriate for top-level matching and document matching (and is complete for that use case); options 2/3 may be used in subclauses to guarantee complete matching of parent {{Spans}}; option 4 results in very thorough scoring. Option 5 would be an unusual use case; but I think there are some applications for full combinatoric matching, and the option was well supported by the implementation, so it is included for the sake of completeness. The candidate implementation models the match graph as a kind of 2-dimensional queue that supports random-access seek and arbitrary node removal. A more thorough explanation would be unwieldy in a comment, so I wrote [three posts|https://michaelgibney.net/lucene/graph/], which respectively: # [Provide some background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the problem associated with LUCENE-7398 (this post is heavily informed by the discussion on this issue) # [Describe the candidate implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in some detail (also includes information on how to configure/test/evaluate) # [Anticipate some possible consequences/applications|https://michaelgibney.net/2018/09/lucene-graph-queries-3/] of new functionality that would be enabled by this (or other equivalent) fix Some notes: # The branch contains (and passes) all tests proposed so far in association with this issue (and also quite a few additional tests) # The candidate implementation is made more complete and performant by the addition of some extra information in the index (e.g., {{positionLength}}). This extra information is currently stored using {{Payload}} s, though for {{positionLength}} at least, there has been some discussion of integrating it more directly in the index (see LUCENE-4312, LUCENE-3843) # Some version of this code has been running in production for several months, and has given no indication of instability, even running every user phrase query (both explicit and {{pf}}) as a graph query. # To facilitate evaluation, the fix is integrated in [master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master], [branch_7x|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7x], [branch_7_5|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_5], and [branch_7_4|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/branch_7_4]. was (Author: mgibney): I have a branch containing a candidate fix for this issue: [LUCENE-7398/master|https://github.com/magibney/lucene-solr/tree/LUCENE-7398/master] It includes support for complete graph-based matching, configurable to include: # all valid top-level {{startPosition}} s # all valid match lengths (in the {{startPosition - endPosition}} sense) # all valid match {{width}} s (in the slop sense) # all redundant matches (different {{Term}} s, same {{startPosition}}, {{endPosition}}, and {{width}}) # all possible valid combinations of subclause positions Option 1 is appropriate for top-level matching and document matching (and is complete for that use case); options 2/3 may be used in subclauses to guarantee complete matching of parent {{Spans}}; option 4 results in very thorough scoring. Option 5 would be an unusual use case; but I think there are some applications for full combinatoric matching, and the option was well supported by the implementation, so it is included for the sake of completeness. The candidate implementation models the match graph as a kind of 2-dimensional queue that supports random-access seek and arbitrary node removal. A more thorough explanation would be unwieldy in a comment, so I wrote [three posts|https://michaelgibney.net/lucene/graph/], which respectively: # [Provide some background|https://michaelgibney.net/2018/09/lucene-graph-queries-1/] on the problem associated with LUCENE-7398 (this post is heavily informed by the discussion on this issue) # [Describe the candidate implementation|https://michaelgibney.net/2018/09/lucene-graph-queries-2/] in some detail (also includes information on how to configure/test/evaluate) #
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892972#comment-15892972 ] Paul Elschot edited comment on LUCENE-7398 at 3/2/17 8:54 PM: -- One way to view the problem is that when span end positions are used to determine the slop, it becomes impossible to determine an order for moving the subspans to a next position. So one direction out of this could be: use NearSpans that determines the slop only by the start positions of the subspans. That leaves only the cases in which the subspans can start (and maybe also end) at the same position. To make sure that all the subspans move forward after a match we could move them all forward until after the current match, and while doing that also count/collect them for scoring/highlighting as long as they are within the match. That should solve the bug reported here, which is about scoring a missed matching occurrence. This limits the required slop to using only the starting positions of the subspans. Could this work? was (Author: paul.elsc...@xs4all.nl): On way to view the problem is that when span end positions are used to determine the slop, it becomes impossible to determine an order for moving the subspans to a next position. So one direction out of this could be: use NearSpans that determines the slop only by the start positions of the subspans. That leaves only the cases in which the subspans can start (and maybe also end) at the same position. To make sure that all the subspans move forward after a match we could move them all forward until after the current match, and while doing that also count/collect them for scoring/highlighting as long as they are within the match. That should solve the bug reported here, which is about scoring a missed matching occurrence. This limits the required slop to using only the starting positions of the subspans. Could this work? > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, > LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, > LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519320#comment-15519320 ] Paul Elschot edited comment on LUCENE-7398 at 9/25/16 10:00 PM: Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be ignored. This introduces SpanNearQuery.MatchNear to choose the matching method. The ORDERED_LAZY case is still the patch of 14 August, this should be changed back to the current implementation, and be used to implement ORDERED_LOOKAHEAD. This implements MatchNear.UNORDERED_STARTPOS and uses that as the default implementation for the unordered case. The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, which is simpler than the current NearSpansUnordered, there is no SpansCell. I'd expect this StartPos implementation to be a little faster, so I also implemented it as default for the unordered case. In only one test case the UNORDERED_LAZY method is needed to pass the test. The question is whether it is ok to change the default unordered implementation to only use the span start positions. The collect() method is moved to the superclass ConjunctionSpans, this simplification might be done at another issue. was (Author: paul.elsc...@xs4all.nl): Patch of 24 Sep 2016, work in progress. This introduces SpanNearQuery.MatchNear to choose the matching method. The ORDERED_LAZY case is still the patch of 14 August, this should be changed back to the current implementation, and be used to implement ORDERED_LOOKAHEAD. This implements MatchNear.UNORDERED_STARTPOS and uses that as the default implementation for the unordered case. The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos, which is simpler than the current NearSpansUnordered, there is no SpansCell. I'd expect this StartPos implementation to be a little faster, so I also implemented it as default for the unordered case. In only one test case the UNORDERED_LAZY method is needed to pass the test. The question is whether it is ok to change the default unordered implementation to only use the span start positions. The collect() method is moved to the superclass ConjunctionSpans, this simplification might be done at another issue. > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch, > LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch, > TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504686#comment-15504686 ] Paul Elschot edited comment on LUCENE-7398 at 9/19/16 9:22 PM: --- The idea is to allow full backward compatibility, as well as more matching methods: UNORDERED_LAZY is the current unordered, UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it should be complete. ORDERED_LAZY is the current ordered, ORDERED_LOOKAHEAD is in the patch of 14 August 2016, ORDERED_STARTPOS also only uses start positions, so it should be complete. The complete ORDERED and UNORDERED cases that use start and end positions and need backtracking are left for later. Comments? was (Author: paul.elsc...@xs4all.nl): The idea is to allow full backward compatibility, as well as more matching methods: UNORDERED_LAZY is the current unordered, UNORDERED_STARTPOS is even simpler, it only uses span start positions, so it should be complete. ORDERED_LAZY is the current ordered, ORDERED_LOOKAHEAD is in the patch of 14 August 2016, ORDERED_STARTPOS is also only uses start positions, so it should be complete. The complete ORDERED and UNORDERED cases that use start and end positions and need backtracking are left for later. Comments? > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, > LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504663#comment-15504663 ] Paul Elschot edited comment on LUCENE-7398 at 9/19/16 8:59 PM: --- I have started on working on a SpanNearQuery that contains this: {code} /** Specifies how clauses are to occur near each other in matching documents. */ public static enum MatchNear { /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the end and start positions of all clauses. * When the subspans vary in length, some matches may not be found. */ UNORDERED_LAZY, /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ UNORDERED_STARTPOS, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found. */ ORDERED_LAZY, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found, * however this method finds more matches than {@link ORDERED_LAZY}. */ ORDERED_LOOKAHEAD, /** Use this method for clauses that match when they are ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ ORDERED_STARTPOS } {code} was (Author: paul.elsc...@xs4all.nl): I have started on working on a SpanNearQuery that contains this: {code} /** Specifies how clauses are to occur near each other in matching documents. */ public static enum MatchNear { /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the end and start positions of all clauses. * When the subspans vary in length, some matches may not be found. */ UNORDERED, /** Use this method for clauses that match when they are not ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ UNORDERED_STARTPOS, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found. */ ORDERED_LAZY, /** Use this method for clauses that can match when they are ordered and span collection is needed, * and the slop should be determined between the end and start positions of the clauses. * When the subspans vary in length, some matches may not be found, * however this method finds more matches than {@link ORDERED_LAZY}. */ ORDERED_LOOKAHEAD, /** Use this method for clauses that match when they are ordered, * and the slop should be determined between the start positions of the first and last matching clauses. */ ORDERED_STARTPOS } {code} > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, > LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888 ] Christoph Goller edited comment on LUCENE-7398 at 9/6/16 3:37 PM: -- After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match don't match. This is caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unordered case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unordered and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least for the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score. was (Author: gol...@detego-software.de): After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888 ] Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:36 AM: -- After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. Documents that should match aer not matching. They are caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unorderd case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unorderd and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least fo the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score. was (Author: gol...@detego-software.de): After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888 ] Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:14 AM: -- After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is one. 1) Missing Matches caused by lazy iteration: I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I know there never was an implementation that compared all possible combinations of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed some matches. *) This ticket demonstrates missing matches for ordered SpanQuery. They are caused by subspans of SpanNearQuery having a variable match length. For these cases the lazy iteration implementation which tries to optimize the number of comparisons of subspan matches is not sufficient. *) Tim tried these examples with unorderd SpanQuery and got the same bahavior. I think this is caused by a similar kind of lazy iteration in the unordered case. *) In the unorderd case lazy iteration also causes problems if the subspans do not have variable-length matches. This is demonstrated in LUCENE-5331 and LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all clauses of the SpanNearQuery were SpanTermQueries, but some occured more than once. For PhraseQuery and MultiPhraseQuery and their implementation in SloppyPhraseScore this seems to be a known problem that has been solved by a special complex treatment of repetitions that I currently don't understand in detail. My current opinion: We should give up lazy iteration for the unorderd and the ordered case to solve these problems. I think it can be done and the performance peanalty should not be too big. We already iterate over all positions of all subspans. So we already have done the expensive operation of reading them. Should some more comparisons of int-values (positions) really matter so much? At least fo the ordered case I am optimistic that I could implement it efficiently. 2) Inconsistent Scoring of SpanNearQuery *) Lazy iteration means that some "redundant" matches in a document are skipped in order to have a faster matching algorithm. I am not sure how redundant was defined exactly for the idea of lazy iteration. It referred to matches with the same start posisiton somehow. As long as different matches for the first clause are concerned, they are found, but not the all matches for intermediate subclauses are regarded. Skipping matches however reduces the frequency that is computed and consequently the score. See Javadoc of phraseFreq() in SloppyPhraseScore which mention the same phenomenon. This is quite important for my use case of SpanQueries. I have different versions/variants of the same term on the same position, e.g. one with case-normalization and one without and I want a higher score if the user-query matches for more than one variant, and I use this approach for clauses of SpanNearQuery. *) In NearSpansOrdered the method width() (it is used to compute sloppy frequency in SpanScore) returns the number of gaps between the matches. If you have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it returns the length of the match, not the number of gaps. See atMatch() for the difference. The reason is probably, that (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength) might even become negative if matches overlap. I would prefer something like Math.max(0, (maxEndPositionCell.endPosition() - minPositionCell().startPosition() - totalSpanLength)) *) SpanOrQuery and SpanNearQuery completely ignore the scores of their subclauses (subweights are always generated as non-scoring). A SpanOrQuery should give a Score similar to a BooleanQuery, shouldn't it? As long as we have this behavior, SpanBoostQuery does not make any sense, doese it? So to my opinion the existance of SpanBoostQuery shows that others also had the idea that a nested SpanQuery should somehow use the scores of their clauses for the computation of their own score. was (Author: gol...@detego-software.de): After thoroughly reviewing the current implementations of SpanNearQuery, PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I volunteer to fix at least some of these problems, but first I would like to have a consensus about the desired bahavior of SpanQuery. This ticket may not be the right place for such a discussion, so please point me to a better place if there is
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465301#comment-15465301 ] Christoph Goller edited comment on LUCENE-7398 at 9/5/16 4:07 PM: -- Paul's 20160814 patch almost convinced me. Unfortunately, it does not fix the case when an intermediate span has a longer match that reduces overall sloppyness but overlaps with a match of a subsequent span and consequently requires advancing the subsequent span. Here is an example Document: w1 w2 w3 w4 w5 near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5))) Add the following code to the end of TestSpanCollection.testNestedNearQuery() {code} SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true); SpanOrQuery q2234 = new SpanOrQuery(q2, q234); SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5")); SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true); SpanOrQuery q455 = new SpanOrQuery(q45, p5); SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 0, true); spans = q1q2234q445.createWeight(searcher, false, 1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS); assertEquals(0, spans.advance(0)); {code} I think we can only fix it if we get give up lazy iteration. I don't think this is so bad for performance. If we implement a clever caching for positions in spans a complete backtracking would only consist of making a few additional int-comparisons. The expensive operation is iterating over all span positions (IO) and we do this already in advancePosition(Spans, int), aren't we. was (Author: gol...@detego-software.de): Paul's fix almost convinced me. Unfortunately, it does not fix the case when an intermediate span has a longer match that reduces overall sloppyness but overlaps with a match of a subsequent span and consequently requires advancing the subsequent span. Here is an example Document: w1 w2 w3 w4 w5 near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5))) Add the following code to the end of TestSpanCollection.testNestedNearQuery() {code} SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true); SpanOrQuery q2234 = new SpanOrQuery(q2, q234); SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5")); SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true); SpanOrQuery q455 = new SpanOrQuery(q45, p5); SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 0, true); spans = q1q2234q445.createWeight(searcher, false, 1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS); assertEquals(0, spans.advance(0)); {code} I think we can only fix it if we get give up lazy iteration. I don't think this is so bad for performance. If we implement a clever caching for positions in spans a complete backtracking would only consist of making a few additional int-comparisons. The expensive operation is iterating over all span positions (IO) and we do this already in advancePosition(Spans, int), aren't we. > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, > LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903 ] Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:21 PM: -- After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me. My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposition Spans to previous positions. By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases: {code} /** The subSpans are ordered in the same doc, so there is a possible match. * Compute the slop while making the match as short as possible by advancing * all subSpans except the last one in reverse order. */ private boolean shrinkToAfterShortestMatch() throws IOException { matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); SetpossibleMatchPayloads = new HashSet<>(); if (subSpans[subSpans.length - 1].isPayloadAvailable()) { possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload()); } Collection possiblePayload = null; int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i--) { Spans prevSpans = subSpans[i]; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { // Advance prevSpans until after (lastStart, lastEnd) if (! prevSpans.next()) { inSameDoc = false; more = false; break; // Check remaining subSpans for final match. } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; // The last subSpans is not advanced here. break; // Check remaining subSpans for last match in this document. } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); // Cannot avoid invoking .end() if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) { break; // Check remaining subSpans. } else { // prevSpans still before (lastStart, lastEnd) prevStart = ppStart; prevEnd = ppEnd; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } } } } if (collectPayloads && possiblePayload != null) { possibleMatchPayloads.addAll(possiblePayload); } assert prevStart <= matchStart; if (matchStart > prevEnd) { // Only non overlapping spans add to slop. matchSlop += (matchStart - prevEnd); } /* Do not break on (matchSlop > allowedSlop) here to make sure * that subSpans[0] is advanced after the match, if any. */ matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903 ] Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:20 PM: -- After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me. My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposition Spans to previous positions. By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases: {code} /** The subSpans are ordered in the same doc, so there is a possible match. * Compute the slop while making the match as short as possible by advancing * all subSpans except the last one in reverse order. */ private boolean shrinkToAfterShortestMatch() throws IOException { matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); SetpossibleMatchPayloads = new HashSet<>(); if (subSpans[subSpans.length - 1].isPayloadAvailable()) { possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload()); } Collection possiblePayload = null; int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i--) { Spans prevSpans = subSpans[i]; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { // Advance prevSpans until after (lastStart, lastEnd) if (! prevSpans.next()) { inSameDoc = false; more = false; break; // Check remaining subSpans for final match. } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; // The last subSpans is not advanced here. break; // Check remaining subSpans for last match in this document. } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); // Cannot avoid invoking .end() if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) { break; // Check remaining subSpans. } else { // prevSpans still before (lastStart, lastEnd) prevStart = ppStart; prevEnd = ppEnd; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } } } } if (collectPayloads && possiblePayload != null) { possibleMatchPayloads.addAll(possiblePayload); } assert prevStart <= matchStart; if (matchStart > prevEnd) { // Only non overlapping spans add to slop. matchSlop += (matchStart - prevEnd); } /* Do not break on (matchSlop > allowedSlop) here to make sure * that subSpans[0] is advanced after the match, if any. */ matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903 ] Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:19 PM: -- After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me. My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposion Spans to previous positions. By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases: {code} /** The subSpans are ordered in the same doc, so there is a possible match. * Compute the slop while making the match as short as possible by advancing * all subSpans except the last one in reverse order. */ private boolean shrinkToAfterShortestMatch() throws IOException { matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); SetpossibleMatchPayloads = new HashSet<>(); if (subSpans[subSpans.length - 1].isPayloadAvailable()) { possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload()); } Collection possiblePayload = null; int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i--) { Spans prevSpans = subSpans[i]; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { // Advance prevSpans until after (lastStart, lastEnd) if (! prevSpans.next()) { inSameDoc = false; more = false; break; // Check remaining subSpans for final match. } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; // The last subSpans is not advanced here. break; // Check remaining subSpans for last match in this document. } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); // Cannot avoid invoking .end() if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) { break; // Check remaining subSpans. } else { // prevSpans still before (lastStart, lastEnd) prevStart = ppStart; prevEnd = ppEnd; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } } } } if (collectPayloads && possiblePayload != null) { possibleMatchPayloads.addAll(possiblePayload); } assert prevStart <= matchStart; if (matchStart > prevEnd) { // Only non overlapping spans add to slop. matchSlop += (matchStart - prevEnd); } /* Do not break on (matchSlop > allowedSlop) here to make sure * that subSpans[0] is advanced after the match, if any. */ matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903 ] Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:18 PM: -- After thoroughly looking into SpanQueries my conclusion is, that we have a fundamental problem in the implementation of SpanNearQuery. The problem is not new, it probably existed already in the first version of SpanQueries which as far as I know were implemented by Doug Cutting himself. I remember some attempts to describe in which cases SpanQueries work correctly and in which they do not (discussions about overlapping), but those explanations and definitions were never completely convincing for me. My best guess: NearSpansOrdered and NearSpansUnordered currently are only correct if for each clause of the SpanQuery we can guarantee, that all its matches have the same length. In this case it is clear that (for the ordered case) if a match is too long (sloppy) we can skip to the first clause and call nextPosition. No alternative matches of intermediate clauses could improve the overall match. If we have clauses with varying match length (SpanOr or SpanNear with sloppyness) we would have to backtrack to intermediate clauses and check whether there are e.g. longer matches that could reduce the overall match length. Pauls last test case shows that even a match of the second clause that advances its position can reduce the overall lenght if it is longer himnself. A match of an intermediate clause at an advanced position could be considerably shorter than its first match requiring a reset of the spans of following clauses. To my opinion this bug can only be fixed by implementing a backtracking search on the subspans that also requires a limited possibilitxy to reposion Spans to previous positions. By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 provided a kind of backtracking which was the reason why my queries worked in elasticsearch 1.7.x. However, I think the implementation also did not solve all cases: {code} /** The subSpans are ordered in the same doc, so there is a possible match. * Compute the slop while making the match as short as possible by advancing * all subSpans except the last one in reverse order. */ private boolean shrinkToAfterShortestMatch() throws IOException { matchStart = subSpans[subSpans.length - 1].start(); matchEnd = subSpans[subSpans.length - 1].end(); SetpossibleMatchPayloads = new HashSet<>(); if (subSpans[subSpans.length - 1].isPayloadAvailable()) { possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload()); } Collection possiblePayload = null; int matchSlop = 0; int lastStart = matchStart; int lastEnd = matchEnd; for (int i = subSpans.length - 2; i >= 0; i--) { Spans prevSpans = subSpans[i]; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } int prevStart = prevSpans.start(); int prevEnd = prevSpans.end(); while (true) { // Advance prevSpans until after (lastStart, lastEnd) if (! prevSpans.next()) { inSameDoc = false; more = false; break; // Check remaining subSpans for final match. } else if (matchDoc != prevSpans.doc()) { inSameDoc = false; // The last subSpans is not advanced here. break; // Check remaining subSpans for last match in this document. } else { int ppStart = prevSpans.start(); int ppEnd = prevSpans.end(); // Cannot avoid invoking .end() if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) { break; // Check remaining subSpans. } else { // prevSpans still before (lastStart, lastEnd) prevStart = ppStart; prevEnd = ppEnd; if (collectPayloads && prevSpans.isPayloadAvailable()) { Collection payload = prevSpans.getPayload(); possiblePayload = new ArrayList<>(payload.size()); possiblePayload.addAll(payload); } } } } if (collectPayloads && possiblePayload != null) { possibleMatchPayloads.addAll(possiblePayload); } assert prevStart <= matchStart; if (matchStart > prevEnd) { // Only non overlapping spans add to slop. matchSlop += (matchStart - prevEnd); } /* Do not break on (matchSlop > allowedSlop) here to make sure * that subSpans[0] is advanced after the match, if any. */ matchStart = prevStart; lastStart = prevStart; lastEnd = prevEnd; } boolean match = matchSlop
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404699#comment-15404699 ] Paul Elschot edited comment on LUCENE-7398 at 8/2/16 8:17 PM: -- The patch changes the order in SpanPositionQueue which is used by SpanOr. This works to pass the test case, and it does not increase complexity. I think the problem is in NearSpansOrdered.stretchToOrder() which only does this: {code}matchEnd = subSpans[subSpans.length - 1].endPosition();{code} What is should also do is lookahead to see whether there is an ordered match with a smaller slop. It could be that there still is a failing case with a nested SpanOr, possibly containing another nested SpanNear, but I'm not sure, this is tricky. Since looking ahead increases the complexity (normal case runtime) I'd prefer to have the patch applied now, and see what the future brings. was (Author: paul.elsc...@xs4all.nl): The patch changes the order in SpanPositionQueue which is used by SpanOr. This works to pass the test case, and it does not increase complexity. I think the problem is in NearSpansOrdered.stretchToOrder() which only does this: {code}matchEnd = subSpans[subSpans.length - 1].endPosition();{code} What is should also do is lookahead to see whether there is an ordered match with a smaller slop. It could be that there still is a failing case with a nested SpanOr, possibly over containing another nested SpanNear, but I'm not sure, this is tricky. Since looking ahead increases the complexity (normal case runtime) I'd prefer to have the patch applied now, and see what the future brings. > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Assignee: Alan Woodward >Priority: Critical > Attachments: LUCENE-7398.patch, TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy
[ https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399414#comment-15399414 ] Christoph Goller edited comment on LUCENE-7398 at 7/29/16 2:40 PM: --- Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that shows the problem. was (Author: gol...@detego-software.de): Please find attatched an extended TestSpanCollection.java that shows the problem. > Nested Span Queries are buggy > - > > Key: LUCENE-7398 > URL: https://issues.apache.org/jira/browse/LUCENE-7398 > Project: Lucene - Core > Issue Type: Bug > Components: core/search >Affects Versions: 5.5, 6.x >Reporter: Christoph Goller >Priority: Critical > Attachments: TestSpanCollection.java > > > Example for a nested SpanQuery that is not working: > Document: Human Genome Organization , HUGO , is trying to coordinate gene > mapping research worldwide. > Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], > 0, true), body:gene]), body:research], 0, true) > The query should match "coordinate gene mapping research" as well as > "coordinate gene research". It does not match "coordinate gene mapping > research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It > probably stopped working with the changes on SpanQueries in 5.3. I will > attach a unit test that shows the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org