[
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604599#comment-14604599
]
Michael McCandless commented on LUCENE-6595:
--------------------------------------------
I think the API change here is necessary, but maybe we can minimize it?
E.g., can we fix the existing BaseCharFilter.addOffCorrectMap method to forward
to the new one that now takes an inputOffset? And can it just pass {{off}} as
the inputOffset (instead of filling with 0)?
I think we may not need the new method BaseCharFilter.correctEnd, but we do
need Tokenizer.correctEndOffset, but can we just implement it as LUCENE-5734
proposed ({{correctOffset(endOffset-1)+1}})?
> CharFilter offsets correction is wonky
> --------------------------------------
>
> Key: LUCENE-6595
> URL: https://issues.apache.org/jira/browse/LUCENE-6595
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Attachments: LUCENE-6595.patch, LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue:
> https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
> ( ->
> ) ->
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
> Output offset: 0 1 2 3
> Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1). It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
> cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]