[
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596742#comment-14596742
]
Michael McCandless commented on LUCENE-6595:
--------------------------------------------
Thanks [~caomanhdat], I'll try to understand your proposed change. But some
tests seem to be failing with this patch, e.g.:
{noformat}
[junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBugInSomething
-Dtests.method=test -Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=hr
-Dtests.timezone=SystemV/PST8PDT -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[junit4] FAILURE 0.00s J2 | TestBugInSomething.test <<<
[junit4] > Throwable #1: java.lang.AssertionError: finalOffset
expected:<16> but was:<20>
[junit4] > at
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:75D8BCDB73FBA305]:0)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:280)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:812)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:674)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:670)
[junit4] > at
org.apache.lucene.analysis.core.TestBugInSomething.test(TestBugInSomething.java:77)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] IGNOR/A 0.01s J2 | TestBugInSomething.testUnicodeShinglesAndNgrams
[junit4] > Assumption #1: 'slow' test group is disabled (@Slow())
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene53): {},
docValues:{}, sim=DefaultSimilarity, locale=hr, timezone=SystemV/PST8PDT
[junit4] 2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation
1.8.0_40 (64-bit)/cpus=8,threads=1,free=370188896,total=519569408
{noformat}
and
{noformat}
[junit4] Suite:
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestHTMLStripCharFilterFactory -Dtests.method=testSingleEscapedTag
-Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=lt_LT
-Dtests.timezone=America/Thule -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
[junit4] ERROR 0.00s J3 |
TestHTMLStripCharFilterFactory.testSingleEscapedTag <<<
[junit4] > Throwable #1: java.lang.NullPointerException
[junit4] > at
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:36A72464D080D0F1]:0)
[junit4] > at
org.apache.lucene.analysis.charfilter.BaseCharFilter.correctEnd(BaseCharFilter.java:82)
[junit4] > at
org.apache.lucene.analysis.CharFilter.correctEndOffset(CharFilter.java:93)
[junit4] > at
org.apache.lucene.analysis.Tokenizer.correctEndOffset(Tokenizer.java:84)
[junit4] > at
org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:176)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:177)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:303)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:327)
[junit4] > at
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory.testSingleEscapedTag(TestHTMLStripCharFilterFactory.java:99)
[junit4] > at java.lang.Thread.run(Thread.java:745)
{noformat}
> CharFilter offsets correction is wonky
> --------------------------------------
>
> Key: LUCENE-6595
> URL: https://issues.apache.org/jira/browse/LUCENE-6595
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Attachments: LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue:
> https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
> ( ->
> ) ->
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
> Output offset: 0 1 2 3
> Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1). It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
> cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]