[
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176524#comment-17176524
]
Roman commented on LUCENE-8776:
-------------------------------
[~simonw] no doubt that the decisions are not always easy, I appreciate the
attention you are giving the matter. The arguments presented here are meant to
provide more information and if you decide that the matter at hand has no
merit, there is no point arguing (and no bad blood). It is actually easier for
us to fork Lucene – but it *seems* (and the stress is 'seems') wrong for Lucene
to intentionally limit itself in what it was doing so well, so I'm trying to
nudge the case.
Consider this: I have disabled the checks in DefaultIndexingChain and rerun
full suite of tests, these tests became failing:
7.7: org.apache.lucene.index.TestPostingsOffsets
8.6: org.apache.lucene.index.TestPostingsOffsets
You'll notice that the *only* new tests failing are those that enforce the
check. Maybe there are no tests written for index integrity?
As to the issue at had: whether the implementation/extension can be made
'pluggable'. My view is following: if you constraint what is *already* in
Lucene, you are forcing people to make forks. We are somewhere inbetween - it
would be easy to provide option to plug a custom chain. It would cost little to
give them the option (with BIG NEON WARNINGS pasted all over if necessary).
Ok, that's an argument by practicality – not a strong one. But how about the
"nothing is broken" part? (yes, the tests that enforce the condition are
failing – but nothing else is broken) . User cases are broken: there are
already two examples of projects that got this complex scenario right (our
project is one such example) - I asked in the forum, and [~dsmiley] and
[~gh_at] struggled in their work with the same issue. I'm not meaning to drag
them in for them to weigh in (but I wouldn't mind obviously ;)), I'm just
trying to illustrate that the limitations break real-case scenarios.
And the benefits still seem to be in the realm of " future possibilities".
Sure, that is not to be dismissed lightly, they are important concerns. But if
we as engineers choose the most efficient over the most optimal, we would
always be building "houses" without windows (these things loose energy, make
people fall from heights, require cleaning - are incredibly wasteful!)
The next thing I could test is to run a performance test with a tokenizer chain
which allows backward postings and the one which employes the flatten tokenizer
and report results. But I'm going to do only if it the case is really open for
consideration, otherwise it would be a waste of time.
> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.6
> Reporter: Ram Venkat
> Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run
> span queries and highlight them properly.
> During index time, light-emitting-diode is split into three words, which
> allows me to search for 'light', 'emitting' and 'diode' individually. The
> three words occupy adjacent positions in the index, as 'light' adjacent to
> 'emitting' and 'light' at a distance of two words from 'diode' need to match
> this word. So, the order of words after splitting are: Organic, light,
> emitting, diode, glows.
> But, I also want to search for 'organic' being adjacent to
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.
> The way I solved this was to also generate 'light-emitting-diode' at two
> positions: (a) In the same position as 'light' and (b) in the same position
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets
> are obviously the same. This works beautifully in Lucene 5.x in both
> searching and highlighting with span queries.
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is
> being thrown without any comments on why this check is needed. As I explained
> above, startOffset going backwards is perfectly valid, to deal with word
> splitting and span operations on these specialized use cases. On the other
> hand, it is not clear what value is added by this check and which highlighter
> code is affected by offsets going backwards. This same check is done at
> BaseTokenStreamTestCase:245.
> I see others talk about how this check found bugs in WordDelimiter etc. but
> it also prevents legitimate use cases. Can this check be removed?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]