[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Roman (Jira) Wed, 12 Aug 2020 11:14:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176524#comment-17176524
 ]


Roman commented on LUCENE-8776:
-------------------------------

[~simonw] no doubt that the decisions are not always easy, I appreciate the 
attention you are giving the matter. The arguments presented here are meant to 
provide more information and if you decide that the matter at hand has no 
merit, there is no point arguing (and no bad blood). It is actually easier for 
us to fork Lucene – but it *seems* (and the stress is 'seems') wrong for Lucene 
to intentionally limit itself in what it was doing so well, so I'm trying to 
nudge the case.

 

Consider this: I have disabled the checks in DefaultIndexingChain and rerun 
full suite of tests, these tests became failing:

 

7.7: org.apache.lucene.index.TestPostingsOffsets

8.6: org.apache.lucene.index.TestPostingsOffsets

 

You'll notice that the *only* new tests failing are those that enforce the 
check. Maybe there are no tests written for index integrity?

 

As to the issue at had: whether the implementation/extension can be made 
'pluggable'. My view is following: if you constraint what is *already* in 
Lucene, you are forcing people to make forks. We are somewhere inbetween - it 
would be easy to provide option to plug a custom chain. It would cost little to 
give them the option (with BIG NEON WARNINGS pasted all over if necessary).

 

Ok, that's an argument by practicality – not a strong one. But how about the 
"nothing is broken" part? (yes, the tests that enforce the condition are 
failing – but nothing else is broken) . User cases are broken: there are 
already two examples of projects that got this complex scenario right (our 
project is one such example) - I asked in the forum, and [~dsmiley] and 
[~gh_at] struggled in their work with the same issue. I'm not meaning to drag 
them in for them to weigh in (but I wouldn't mind obviously ;)), I'm just 
trying to illustrate that the limitations break real-case scenarios.

And the benefits still seem to be in the realm of " future possibilities". 
Sure, that is not to be dismissed lightly, they are important concerns. But if 
we as engineers choose the most efficient over the most optimal, we would 
always be building "houses" without windows (these things loose energy, make 
people fall from heights, require cleaning - are incredibly wasteful!)

The next thing I could test is to run a performance test with a tokenizer chain 
which allows backward postings and the one which employes the flatten tokenizer 
and report results. But I'm going to do only if it the case is really open for 
consideration, otherwise it would be a waste of time.

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to