[jira] [Comment Edited] (LUCENE-8034) SpanNotWeight returns wrong results due to integer overflow

Hari Menon (JIRA) Sun, 05 Nov 2017 18:39:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239836#comment-16239836
 ]


Hari Menon edited comment on LUCENE-8034 at 11/6/17 2:38 AM:
-------------------------------------------------------------

[~mikemccand] That's a good question, and actually something I could use help 
with. It would be awesome if you could let me know if there are any potential 
bottlenecks with the way I am trying to solve this problem. Let me know if I 
should instead post to the users@ mailing list. Here is the problem I am trying 
to solve:

I have to index documents of type A, which internally have sub-documents of 
type B. e.g A1 might contain sub-documents B11, B12, B13 etc. A2 can contain 
B21, B22, B23, B24 and so on. My search use case is such that I might want to 
have matches where all the search terms are within a particular B-document, or 
it could be within a particular A-document. Besides, I need the B-document Ids 
that matched in both cases. I know that my B-documents have a fixed max. number 
of words (say 500). The way I am solving this right now is:
- Use A as the lucene document to be indexed, with a field "text" containing 
text from the B sub-documents.
- The idea is to index B11 between position 0 and 499, B12 from 1000 to 1499, 
B13 from 2000 to 2499 and so on. I am using PositionIncrementTokenStream to fix 
the positions.
- Then use SpanQueries with slop of 500 if we want to search within 
B-documents, and slop of Int.MAX_VALUE if we want to search in the entire 
A-document. Using SpanQuery also gives me easy access to position, which I can 
then divide by 1000 to get the index of the actual B-document. This is where I 
was trying to use max span of Int.MAX_VALUE and ran into this issue.

Does this make sense? Let me know if you see any gaping holes or perf issues 
with this approach. I am still new to lucene and haven't done a full perf 
benchmark with this approach as I am still building a prototype.

[~rcmuir] Will it affect scores? I think it will just not select the given 
record, right?


was (Author: hshankar):
[~mikemccand] That's a good question, and actually something I could use help 
with. It would be awesome if you could let me know if there are any potential 
bottlenecks with the way I am trying to solve this problem. Let me know if I 
should instead post to the users@ mailing list. Here is the problem I am trying 
to solve:

I have to index documents of type A, which internally have sub-documents of 
type B. e.g A1 might contain sub-documents B11, B12, B13 etc. A2 can contain 
B21, B22, B23, B24 and so on. My search use case is such that I might want to 
have matches where all the search terms are within a particular B-document, or 
it could be within a particular A-document. Besides, I need the B-document Ids 
that matched in both cases. I know that my B-documents have a fixed max. number 
of words (say 500). The way I am solving this right now is:
- Use A as the lucene document to be indexed, with a field "text" containing 
text from the B sub-documents.
- The idea is to index B11 between position 0 and 499, B12 from 1000 to 1499, 
B13 from 2000 to 2499 and so on. I am using PositionIncrementTokenStream to fix 
the positions.
- Then use SpanQueries with max span of 500 if we want to search within 
B-documents, and max span of Int.MAX_VALUE if we want to search in the entire 
A-document. Using SpanQuery also gives me easy access to position, which I can 
then divide by 500 to get the index of the actual B-document. This is where I 
was trying to use max span of Int.MAX_VALUE and ran into this issue.

Does this make sense? Let me know if you see any gaping holes or perf issues 
with this approach. I am still new to lucene and haven't done a full perf 
benchmark with this approach as I am still building a prototype.

[~rcmuir] Will it affect scores? I think it will just not select the given 
record, right?

> SpanNotWeight returns wrong results due to integer overflow
> -----------------------------------------------------------
>
>                 Key: LUCENE-8034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8034
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/query/scoring, core/search
>            Reporter: Hari Menon
>            Priority: Minor
>              Labels: newbie, patch
>         Attachments: LUCENE-8034.patch
>
>
> In SpanNotQuery, there is an acceptance condition:
> {code:java}
> if (candidate.endPosition() + post <= excludeSpans.startPosition()) {
>     return AcceptStatus.YES;
> }
> {code}
> This overflows in case `candidate.endPosition() + post > Integer.MAX_VALUE`. 
> I have a fix for this which I am working on. Basically I am flipping the add 
> to a subtract.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8034) SpanNotWeight returns wrong results due to integer overflow

Reply via email to