[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Simon Willnauer (Jira) Wed, 12 Aug 2020 02:01:45 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176184#comment-17176184
 ]


Simon Willnauer commented on LUCENE-8776:
-----------------------------------------

I think preventing negative offsets is more than just allowing people to pay 
the price of 5 bytes per negative offset. There are many more things than 
compression based on this behavior. Just a few that come to my mind right away:

 * Correctness verification of indices, if we see a negative offset the index 
must be broken. 
 * Catch broken analyzers early
 * have clear bounds of what an offset can be and in what direction it can 
grow. This helps tons if you want to implement future features that can rely on 
this behavior. 
 
{quote}
So if we just add a simple mechanism to instruct Lucene which DocConsumer to 
use, then all could be happy and not have to resort to dirty hacks or forks. 
The most efficient impl will be the default, yet will allow us us - dirty 
bastards - shoot ourselves in foot if we so desire. SOLR as well as 
ElasticSearch devs might not mind having the option in the future - it can come 
in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just 
useful... could I do it?
{quote}

if you feel like you wanna implement this, go ahead. I am sure there will be 
more issues like Check Index will not work anymore etc. There might be future 
features you will break with doing this. But again it's all open source, nobody 
forces you to upgrade or use the code we package directly. You can download the 
source and modify it that' is all up to you.

We as a community need to make decisions that sometimes don't work for 
everybody. We have a great responsibility with a project like Lucene being used 
in an unbelievable wide range of applications and that sometimes means to add 
restrictions. We don't take this easily, it's almost always a hard decisions. 
Having offsets going forward only will be a win for a large majority and that's 
why we keep on having this restrictions. I am totally sorry for you struggle.

Talking about extending DocConsumer, I am torn it should be a extension point. 
I know that there is an implementation that uses it out there but if I could 
pick I'd remove this extension since it's, as you say, way to hard to get it 
right. 


> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to