I've run into some difficulties with offsets in some TokenFilters I've been
writing, and I wonder if anyone can shed any light. Because characters may
be inserted or removed by prior filters (eg ICUFoldingFilter does this with
ellipses), and there is no offset-correcting data structure available to
TokenFilters (as there is in CharFilter), there doesn't seem to be any
reliable way to calculate the offset at a point interior to a token, which
means that essentially the only reasonable thing to do with OffsetAttribute
is to preserve the offsets from the input. This is means that filters that
split their tokens (like WordDelimiterGraphFilter) have no reliable way of
mapping their split tokens' offsets. One can try, but it seems inevitably
to require making some arbitrary "fixup" stage in order to guarantee that
the offsets are nondecreasing and properly bounded by the original text
length.

If this analysis is correct, it seems one should really never call
OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
method to provide. (hmm now I see this comment in OffsetAttributeImpl
suggesting making the method call-once). If that really is the case, I
think some assertion, deprecation, or other API protection would be useful
so the policy is clear.

Alternatively, do we want to consider providing a "fixup" API as we have
for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
API for setting offset deltas. This would make more precise highlighting
possible in these cases, at least. I'm not sure what other use cases folks
have come up with for offsets?

-Mike

Reply via email to