[
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825116#comment-16825116
]
Ram Venkat commented on LUCENE-8776:
------------------------------------
Adrien - That will not work as searching for "organic adjacent to lighting"
would highlight the entire word "light-emitting-diode" instead of just "light".
And only light or diode gets highlighted when light-emitting-diode is given the
same offset as light or diode (when you search for light-emitting-diode).
Robert,
We are not writing any new 'bad" algorithm. We have been using this feature for
a while. Allowing offsets to go backwards is an existing feature in Lucene for
a long time. This check and exception broke that feature.
And, no, I am not asking anyone to buy more hardware. It's just a figure of
speech to say that the net performance depends on many factors and a certain
part of code being \{{O(n^2)} may or may not affect the net performance, due to
many other factors. In our case, it does not. That is all the point I want to
make.
Removing a long existing feature in Lucene because (a) it affects a newer
feature (postings) which is used by some people or (b) might cause a noticeable
performance degradation in some cases, is not a great argument. We are
dependent on this feature. We have no alternatives at this point. And, I have
proof that it does not affect performance in a noticeable way, with extensive
testing in our environment/data etc. Plus, I am guessing that we are not the
only one in the world using this feature.
For these reasons, we should either move this check and exception to other
parts of Lucene (without affecting indexing and standard highlighter) or remove
it.
> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.6
> Reporter: Ram Venkat
> Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run
> span queries and highlight them properly.
> During index time, light-emitting-diode is split into three words, which
> allows me to search for 'light', 'emitting' and 'diode' individually. The
> three words occupy adjacent positions in the index, as 'light' adjacent to
> 'emitting' and 'light' at a distance of two words from 'diode' need to match
> this word. So, the order of words after splitting are: Organic, light,
> emitting, diode, glows.
> But, I also want to search for 'organic' being adjacent to
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.
> The way I solved this was to also generate 'light-emitting-diode' at two
> positions: (a) In the same position as 'light' and (b) in the same position
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets
> are obviously the same. This works beautifully in Lucene 5.x in both
> searching and highlighting with span queries.
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is
> being thrown without any comments on why this check is needed. As I explained
> above, startOffset going backwards is perfectly valid, to deal with word
> splitting and span operations on these specialized use cases. On the other
> hand, it is not clear what value is added by this check and which highlighter
> code is affected by offsets going backwards. This same check is done at
> BaseTokenStreamTestCase:245.
> I see others talk about how this check found bugs in WordDelimiter etc. but
> it also prevents legitimate use cases. Can this check be removed?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]