[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Roman (Jira) Mon, 10 Aug 2020 15:25:09 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175081#comment-17175081
 ]


Roman commented on LUCENE-8776:
-------------------------------

I too suffer from the same issue, we have multi-token synonyms that can even 
overlap. I recognize the arguments against the backward offsets but I find them 
surprisingly backwards: they are saying that the implementation dictates 
function. When the function is what (for many people) is the goal. The 
arguments seem also to say that the most efficient implementation (non-negative 
integer deltas) does not allow backward offsets, therefore backward offsets is 
a bug. 

Please recognize, that the most elegant implementation sometimes mean "as 
complex as needed" – it is not the same as "the simplest". If negative vints 
consume 5 bytes instead of 4, some people need to and are willing to pay that 
price. Their use cases cannot be simply 'boxed' into the world where one is 
only looking ahead and never back (NLP is one such world)

Lucene is however inviting one particular solution:

The implementation of vint seems not mind if there is a negative offset 
(https://issues.apache.org/jira/browse/LUCENE-3738) and DefaultIndexingChain 
extends DocConsumer – the name 'Default' suggests that at some point in the 
past, Lucene developers wanted to provide other implementations. As it is 
*right now*, it is not easy to plug in a different 'DocConsumer' – that surely 
seems like an important omission! (one size fits all?). 

So if we just add a simple mechanism to instruct Lucene which DocConsumer to 
use, then all could be happy and not have to resort to dirty hacks or forks. 
The most efficient impl will be the default, yet will allow us us - dirty 
bastards - shoot ourselves in foot if we so desire. SOLR as well as 
ElasticSearch devs might not mind having the option in the future - it can come 
in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just 
useful... could I do it? [~rcmuir] [~mikemccand] [~simonw]

 

 

 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

Reply via email to