Stacked tokens of differing character length - what does it actually break?

Gus Heck Thu, 08 Aug 2024 04:52:19 -0700

Pardon my vagueness in this question, it's client technology that I can't
reveal too many details about...


In the docs I see:

   - Tokens that have the same start position must have the same start
   offset.
   - Tokens that have the same end position (taking into account the
   position length) *must have the same end offset*.

If I wanted to consume a stream of information about text that might
contain several "tokens" of interest (or at least something we want to
treat as a token) that reference the same start character but span
differing numbers of characters, what *actually* goes wrong if the end
offsets differ?

The incoming data stream won't identify (or even have a concept of) the
additional positions because it does not (for example) break on spaces...
So to apply variable position lengths I'd have to sub-analyze the matched
characters (which I do have  access to). I'd rather not waste either
development cycles or clock cycles doing that.

It's of course perfectly fine (ideal and intended actually) for
highlighting to differ depending on which "token" was matched. Also
overlapping highlights would not bother me. I'm also not particularly
worried about any effects on span or phrase queries since I don't
anticipate that usage. Start offsets can be enforced as non-decreasing with
no problem, and the start position requirement above is possible though
there are some cases where it might be more convenient to let the start
offset vary too.

I did a test of this some time ago with an earlier version of Lucene and
nothing seemed to go wrong that I could see, but I'm interested in why the
bullets above are documented as a MUST, and what subtle thing beyond
highlighting and spans I might be missing. (and decide if I might be
willing to tolerate the side effect if it doesn't actually blow something
up)

I guess to boil it down, are those "must" statements really truly a "must"
or just a "really should for best/normal results"?

-Gus

-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Stacked tokens of differing character length - what does it *actually* break?

Reply via email to

Stacked tokens of differing character length - what does it actually break?