Pardon my vagueness in this question, it's client technology that I can't reveal too many details about...
In the docs I see: - Tokens that have the same start position must have the same start offset. - Tokens that have the same end position (taking into account the position length) *must have the same end offset*. If I wanted to consume a stream of information about text that might contain several "tokens" of interest (or at least something we want to treat as a token) that reference the same start character but span differing numbers of characters, what *actually* goes wrong if the end offsets differ? The incoming data stream won't identify (or even have a concept of) the additional positions because it does not (for example) break on spaces... So to apply variable position lengths I'd have to sub-analyze the matched characters (which I do have access to). I'd rather not waste either development cycles or clock cycles doing that. It's of course perfectly fine (ideal and intended actually) for highlighting to differ depending on which "token" was matched. Also overlapping highlights would not bother me. I'm also not particularly worried about any effects on span or phrase queries since I don't anticipate that usage. Start offsets can be enforced as non-decreasing with no problem, and the start position requirement above is possible though there are some cases where it might be more convenient to let the start offset vary too. I did a test of this some time ago with an earlier version of Lucene and nothing seemed to go wrong that I could see, but I'm interested in why the bullets above are documented as a MUST, and what subtle thing beyond highlighting and spans I might be missing. (and decide if I might be willing to tolerate the side effect if it doesn't actually blow something up) I guess to boil it down, are those "must" statements really truly a "must" or just a "really should for best/normal results"? -Gus -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)