Re: Stacked tokens of differing character length - what does it actually break?

David Smiley Mon, 23 Sep 2024 13:31:07 -0700

IMO the offset ordering constraint should have been a SHOULD not a MUST.
I'm not aware of any actual impact of non-ascending ordering.  Where I
work, we have a fork of Lucene to modify the one-liner enforcement of the
rule.  We've noticed no ill-effects.  It had been like this for years, back
into the Lucene 4.x days, maybe even before.  I think this constraint
should be relaxed.



On Thu, Aug 8, 2024 at 7:52 AM Gus Heck <gus.h...@gmail.com> wrote:

> Pardon my vagueness in this question, it's client technology that I can't
> reveal too many details about...
>
> In the docs I see:
>
>    - Tokens that have the same start position must have the same start
>    offset.
>    - Tokens that have the same end position (taking into account the
>    position length) *must have the same end offset*.
>
> If I wanted to consume a stream of information about text that might
> contain several "tokens" of interest (or at least something we want to
> treat as a token) that reference the same start character but span
> differing numbers of characters, what *actually* goes wrong if the end
> offsets differ?
>
> The incoming data stream won't identify (or even have a concept of) the
> additional positions because it does not (for example) break on spaces...
> So to apply variable position lengths I'd have to sub-analyze the matched
> characters (which I do have  access to). I'd rather not waste either
> development cycles or clock cycles doing that.
>
> It's of course perfectly fine (ideal and intended actually) for
> highlighting to differ depending on which "token" was matched. Also
> overlapping highlights would not bother me. I'm also not particularly
> worried about any effects on span or phrase queries since I don't
> anticipate that usage. Start offsets can be enforced as non-decreasing with
> no problem, and the start position requirement above is possible though
> there are some cases where it might be more convenient to let the start
> offset vary too.
>
> I did a test of this some time ago with an earlier version of Lucene and
> nothing seemed to go wrong that I could see, but I'm interested in why the
> bullets above are documented as a MUST, and what subtle thing beyond
> highlighting and spans I might be missing. (and decide if I might be
> willing to tolerate the side effect if it doesn't actually blow something
> up)
>
> I guess to boil it down, are those "must" statements really truly a "must"
> or just a "really should for best/normal results"?
>
> -Gus
>
> --
> http://www.needhamsoftware.com (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>

Re: Stacked tokens of differing character length - what does it *actually* break?

Reply via email to

Re: Stacked tokens of differing character length - what does it actually break?