[
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604419#comment-14604419
]
Michael McCandless commented on LUCENE-6595:
--------------------------------------------
bq. And do you agree this issue is the same as LUCENE-5734 ?
This looks like the same issue to me, although since HTMLStripCharFilter
"knows" it's replacing HTML entities (I think?) it could be smarter about
correcting offsets, vs e.g. MappingCharFilter which needs to be
generic/agnostic as to what exactly it's remapping.
My first idea was the same idea proposed on LUCENE-5734: add a new
correctEndOffset method, which defaults to {{correctOffset(endOffset-1)+1}} but
then this "fails" the cccc -> cc case.
[~caomanhdat]'s approach here is to store another int per correction, which is
the input offset where the correction first applied, which is a neat solution:
it seems to solve my two examples, and I think would solve LUCENE-5734 as well?
Any HTML entity that maps to empty string (e.g. <em>, </em>, <b>, etc., I
think?) would not be included within the output token's start/endOffset, unless
that entity was "inside" a token.
> CharFilter offsets correction is wonky
> --------------------------------------
>
> Key: LUCENE-6595
> URL: https://issues.apache.org/jira/browse/LUCENE-6595
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Attachments: LUCENE-6595.patch, LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue:
> https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
> ( ->
> ) ->
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
> Output offset: 0 1 2 3
> Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1). It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
> cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]