[ https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576353#comment-16576353 ]
Michael McCandless commented on LUCENE-8450: -------------------------------------------- {quote}Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead. {quote} I like that idea – it always seemed weird that the {{correctOffset}} was only available via {{CharFilter}} whereas {{OffsetAttribute}} is really the more natural place for it. > Enable TokenFilters to assign offsets when splitting tokens > ----------------------------------------------------------- > > Key: LUCENE-8450 > URL: https://issues.apache.org/jira/browse/LUCENE-8450 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Mike Sokolov > Priority: Major > Attachments: offsets.patch > > > CharFilters and TokenFilters may alter token lengths, meaning that subsequent > filters cannot perform simple arithmetic to calculate the original > ("correct") offset of a character in the interior of the token. A similar > situation exists for Tokenizers, but these can call > CharFilter.correctOffset() to map offsets back to their original location in > the input stream. There is no such API for TokenFilters. > This issue calls for adding an API to support use cases like highlighting the > correct portion of a compound token. For example the german word > "außerstand" (meaning afaict "unable to do something") will be decompounded > and match "stand and "ausser", but as things are today, offsets are always > set using the start and end of the tokens produced by Tokenizer, meaning that > highlighters will match the entire compound. > I'm proposing to add this method to `TokenStream`: > {{ public CharOffsetMap getCharOffsetMap();}} > referencing a CharOffsetMap with these methods: > {{ int correctOffset(int currentOff);}} > {{ int uncorrectOffset(int originalOff);}} > > The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from > original offset forward to the current "offset space". -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org