Yes, in fact Tokenizer already provides correctOffset which just delegates
to CharFilter. We could expand on this, moving correctOffset up to
TokenStream, and also adding correct() so that TokenFilters can add to the
character offset data structure (two int arrays) and share it across the
analysis chain.

Implementation-wise this could continue to delegate to CharFilter I guess,
but I think it would be better to add a character-offset-map abstraction
that wraps the two int arrays and provides the correct/correctOffset
methods to both TokenStream and CharFilter.

This would let us preserve correct offsets in the face of manipulations
like replacing ellipses, ligatures (like AE, OE), trademark symbols
(replaced by "tm") and the like so that we can have the invariant that
correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() ==
correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
token-splitting with correct offsets.

I can work up a proof of concept; I don't think it would be too
API-intrusive or change performance in a significant way.  Only
TokenFilters that actually care about this (ie that insert or remove
characters, or split tokens) would need to change; others would continue to
work as-is.

Reply via email to