[ 
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179066#comment-17179066
 ] 

Michael Gibney commented on LUCENE-8776:
----------------------------------------

{quote}I think there is an issue where someone made a prototype that indexed 
position length into payloads and then did the right thing at search time. 
There was also a talk about it at Buzzwords last summer, I think.
{quote}
[~mikemccand] I think you're talking about the work I put in on LUCENE-7398 
(and the [related 
talk|https://2019.berlinbuzzwords.de/19/session/complete-precise-graph-based-phrase-query-spannearquery.html]
 at Buzzwords). Quick update there: that work has been ported forward through 
8.3 (due to be ported forward again soon I hope), and it's been solidly running 
in production for several years. That branch has a bunch of test cases added to 
[TestSpanCollection.java|https://github.com/magibney/lucene-solr/blob/LUCENE-7398/master/lucene/core/src/test/org/apache/lucene/search/spans/TestSpanCollection.java]
 that are relevant to the discussion here (particularly wrt the problem that 
the double-emitted "light-emitting-diode" token workaround seeks to address).

I'm curious about the multi-tokenizer use case; off the top of my head I can 
see that it would save some terms dictionary space (for tokenizers that 
generate a lot of tokens in common), and could help reduce overall field count, 
if that's an issue. Single-language positional queries (e.g. spans, intervals) 
should work, but mixed-language positional queries would not (not a problem, 
but could be useful in some circumstances?). Interleaved tokens (as opposed to 
concatenated streams) at query-time would probably work, and would probably 
"work" at index-time in the sense that they could be indexed within the 
constraints in DefaultIndexingChain – but index-time interleaved tokens would I 
think hit unexpected behavior pretty quickly for positional queries (spans, 
intervals) unless the graph structure (i.e., at least positionLength) were 
indexed ...

{quote}position length cannot substitute offsets; even though it is proposed to 
solve the same thing
{quote}
Indeed, positionLength is not a general-purpose substitute for offsets; I never 
intended to suggest otherwise. But LUCENE-4312 and the LUCENE-7398 work is 
relevant to this issue because it presents a more robust alternative to Ram's 
double-token-emission workaround for positional queries.

Roman, I gather that your case is different, but unfortunately I still can't 
really pinpoint the problem. I see the checks in the DefaultIndexingChain code, 
I see that the previous invertState _startPosition_ is part of validation, I 
acknowledge the need for both index-time and query-time expansion (this was the 
main motivation for my work on LUCENE-7398!)... but the startOffsets in the 
examples you sent _don't_ decrease, and the endOffset of each token is greater 
than its startOffset, so I don't see anything that would violate the 
constraints in DefaultIndexingChain. I was hoping that seeing your actual 
analysis chain would help reproduce your issue locally, but it's not as 
straightforward as I'd hoped, given the number of custom analysis components (I 
also note, without thinking through possible implications, the lack of 
FlattenGraphFilter, and the fact that your word-delimiter and synonym filters 
seem (\?) to be monkey-patched variants of "pre-graph" {{WordDelimiterFilter}} 
and {{SynonymFilter}}).

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run 
> span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which 
> allows me to search for 'light', 'emitting' and 'diode' individually. The 
> three words occupy adjacent positions in the index, as 'light' adjacent to 
> 'emitting' and 'light' at a distance of two words from 'diode' need to match 
> this word. So, the order of words after splitting are: Organic, light, 
> emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two 
> positions: (a) In the same position as 'light' and (b) in the same position 
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets 
> are obviously the same. This works beautifully in Lucene 5.x in both 
> searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go 
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is 
> being thrown without any comments on why this check is needed. As I explained 
> above, startOffset going backwards is perfectly valid, to deal with word 
> splitting and span operations on these specialized use cases. On the other 
> hand, it is not clear what value is added by this check and which highlighter 
> code is affected by offsets going backwards. This same check is done at 
> BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but 
> it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to