[
https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177977#comment-17177977
]
Michael Gibney commented on LUCENE-8776:
----------------------------------------
{quote}I see this solution as working around the fact that {{positionLength}}
is not indexed in Lucene
{quote}
I think that indeed describes the motivation behind [~venkat11]'s example. For
some such cases, restricting graph tokenStreams to query-time would help; but I
don't think there's any escaping the fact that there are cases where graph
tokenStreams would more appropriately be generated at index time (e.g., using
extra context/information that can _only_ be available at index time). There's
always LUCENE-4312 :-) ...
I'm also still struggling to understand [[email protected]]'s examples,
which re-focused attention on this issue; like [~mikemccand], I don't see why
any of these examples would fail based on the offset constraints in
{{DefaultIndexingChain}}. AFAICT, there is no "{{offsetEnd <
lastToken.offsetEnd}}" restriction; each startOffset in the examples above is
{{>=}} the preceding startOffset, and each endOffset is {{>=}} its
corresponding startOffset. I feel like I'm missing something.
[[email protected]], can you share your actual analysis chain config?
No matter what startOffset/endOffsets your tokens have, even if they get
_really_ crazy, it should be possible (at least in principle) to order tokens
such that the constraint of "no backward startOffsets" holds, no? I think, as
[~dsmiley] hints at, where this could (perhaps?) become impossible is if such
ordering would conflict with ordering based on positions.
To my mind (echoing [~simonw]'s "have clear bounds of what an offset can be and
in what direction it can grow"): the main benefit of formalizing/enforcing the
order of tokens/offsets in a tokenStream is so that consumers (including
indirect consumers, e.g. via postings at query-time) know what to expect. Even
an index that's not strictly-speaking "corrupt" can be made more
useful/efficient if types of order that _can_ be enforced are enforced.
Two main questions occur to me:
# Are there use cases that truly cannot be supported (even in principle, never
mind with the current state of analysis components) with strict ordering
constraints based on token offsets, positions, and maybe positionLengths?
# Is there enough existing functionality that people have built around the
historic _lack_ of constraints that the horse has left the barn, and the
ability to toggle this behavior off should be provided, absent some practical
compulsion otherwise (e.g., actual index incompatibility/corruption, as opposed
to simply sanity-checking input)?
Perhaps a bit off-topic, but ideally I could see:
# indexed positionLength
# strict ordering of tokens by increasing position (this is the case by
definition, I think?), and for a given position, order by increasing
positionLength
# offsets compatible with positions such that the above position-based ordering
would also result in ordering of tokens by increasing startOffset (perhaps even
adding the constraint that for a given startOffset, endOffset would never
decrease?)
> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
> Key: LUCENE-8776
> URL: https://issues.apache.org/jira/browse/LUCENE-8776
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
> Affects Versions: 7.6
> Reporter: Ram Venkat
> Priority: Major
> Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run
> span queries and highlight them properly.
> During index time, light-emitting-diode is split into three words, which
> allows me to search for 'light', 'emitting' and 'diode' individually. The
> three words occupy adjacent positions in the index, as 'light' adjacent to
> 'emitting' and 'light' at a distance of two words from 'diode' need to match
> this word. So, the order of words after splitting are: Organic, light,
> emitting, diode, glows.
> But, I also want to search for 'organic' being adjacent to
> 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'.
> The way I solved this was to also generate 'light-emitting-diode' at two
> positions: (a) In the same position as 'light' and (b) in the same position
> as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets
> are obviously the same. This works beautifully in Lucene 5.x in both
> searching and highlighting with span queries.
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go
> backwards" at DefaultIndexingChain:818. This IllegalArgumentException is
> being thrown without any comments on why this check is needed. As I explained
> above, startOffset going backwards is perfectly valid, to deal with word
> splitting and span operations on these specialized use cases. On the other
> hand, it is not clear what value is added by this check and which highlighter
> code is affected by offsets going backwards. This same check is done at
> BaseTokenStreamTestCase:245.
> I see others talk about how this check found bugs in WordDelimiter etc. but
> it also prevents legitimate use cases. Can this check be removed?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]