IMO the offset ordering constraint should have been a SHOULD not a MUST. I'm not aware of any actual impact of non-ascending ordering. Where I work, we have a fork of Lucene to modify the one-liner enforcement of the rule. We've noticed no ill-effects. It had been like this for years, back into the Lucene 4.x days, maybe even before. I think this constraint should be relaxed.
On Thu, Aug 8, 2024 at 7:52 AM Gus Heck <gus.h...@gmail.com> wrote: > Pardon my vagueness in this question, it's client technology that I can't > reveal too many details about... > > In the docs I see: > > - Tokens that have the same start position must have the same start > offset. > - Tokens that have the same end position (taking into account the > position length) *must have the same end offset*. > > If I wanted to consume a stream of information about text that might > contain several "tokens" of interest (or at least something we want to > treat as a token) that reference the same start character but span > differing numbers of characters, what *actually* goes wrong if the end > offsets differ? > > The incoming data stream won't identify (or even have a concept of) the > additional positions because it does not (for example) break on spaces... > So to apply variable position lengths I'd have to sub-analyze the matched > characters (which I do have access to). I'd rather not waste either > development cycles or clock cycles doing that. > > It's of course perfectly fine (ideal and intended actually) for > highlighting to differ depending on which "token" was matched. Also > overlapping highlights would not bother me. I'm also not particularly > worried about any effects on span or phrase queries since I don't > anticipate that usage. Start offsets can be enforced as non-decreasing with > no problem, and the start position requirement above is possible though > there are some cases where it might be more convenient to let the start > offset vary too. > > I did a test of this some time ago with an earlier version of Lucene and > nothing seemed to go wrong that I could see, but I'm interested in why the > bullets above are documented as a MUST, and what subtle thing beyond > highlighting and spans I might be missing. (and decide if I might be > willing to tolerate the side effect if it doesn't actually blow something > up) > > I guess to boil it down, are those "must" statements really truly a "must" > or just a "really should for best/normal results"? > > -Gus > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) >