I've run into some difficulties with offsets in some TokenFilters I've been writing, and I wonder if anyone can shed any light. Because characters may be inserted or removed by prior filters (eg ICUFoldingFilter does this with ellipses), and there is no offset-correcting data structure available to TokenFilters (as there is in CharFilter), there doesn't seem to be any reliable way to calculate the offset at a point interior to a token, which means that essentially the only reasonable thing to do with OffsetAttribute is to preserve the offsets from the input. This is means that filters that split their tokens (like WordDelimiterGraphFilter) have no reliable way of mapping their split tokens' offsets. One can try, but it seems inevitably to require making some arbitrary "fixup" stage in order to guarantee that the offsets are nondecreasing and properly bounded by the original text length.
If this analysis is correct, it seems one should really never call OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of method to provide. (hmm now I see this comment in OffsetAttributeImpl suggesting making the method call-once). If that really is the case, I think some assertion, deprecation, or other API protection would be useful so the policy is clear. Alternatively, do we want to consider providing a "fixup" API as we have for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an API for setting offset deltas. This would make more precise highlighting possible in these cases, at least. I'm not sure what other use cases folks have come up with for offsets? -Mike