How would a fixup API work? We would try to provide correctOffset throughout the full analysis chain?
Mike McCandless http://blog.mikemccandless.com On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov <msoko...@gmail.com> wrote: > I've run into some difficulties with offsets in some TokenFilters I've been > writing, and I wonder if anyone can shed any light. Because characters may > be inserted or removed by prior filters (eg ICUFoldingFilter does this with > ellipses), and there is no offset-correcting data structure available to > TokenFilters (as there is in CharFilter), there doesn't seem to be any > reliable way to calculate the offset at a point interior to a token, which > means that essentially the only reasonable thing to do with OffsetAttribute > is to preserve the offsets from the input. This is means that filters that > split their tokens (like WordDelimiterGraphFilter) have no reliable way of > mapping their split tokens' offsets. One can try, but it seems inevitably > to require making some arbitrary "fixup" stage in order to guarantee that > the offsets are nondecreasing and properly bounded by the original text > length. > > If this analysis is correct, it seems one should really never call > OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of > method to provide. (hmm now I see this comment in OffsetAttributeImpl > suggesting making the method call-once). If that really is the case, I > think some assertion, deprecation, or other API protection would be useful > so the policy is clear. > > Alternatively, do we want to consider providing a "fixup" API as we have > for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an > API for setting offset deltas. This would make more precise highlighting > possible in these cases, at least. I'm not sure what other use cases folks > have come up with for offsets? > > -Mike >