[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

Michael McCandless (JIRA) Mon, 22 Jun 2015 15:11:34 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596742#comment-14596742
 ]


Michael McCandless commented on LUCENE-6595:
--------------------------------------------

Thanks [~caomanhdat], I'll try to understand your proposed change.  But some 
tests seem to be failing with this patch, e.g.:

{noformat}
   [junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBugInSomething 
-Dtests.method=test -Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=hr 
-Dtests.timezone=SystemV/PST8PDT -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.00s J2 | TestBugInSomething.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: finalOffset 
expected:<16> but was:<20>
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:75D8BCDB73FBA305]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:280)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:812)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:674)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:670)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestBugInSomething.test(TestBugInSomething.java:77)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4] IGNOR/A 0.01s J2 | TestBugInSomething.testUnicodeShinglesAndNgrams
   [junit4]    > Assumption #1: 'slow' test group is disabled (@Slow())
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene53): {}, 
docValues:{}, sim=DefaultSimilarity, locale=hr, timezone=SystemV/PST8PDT
   [junit4]   2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 
1.8.0_40 (64-bit)/cpus=8,threads=1,free=370188896,total=519569408
{noformat}

and

{noformat}
   [junit4] Suite: 
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestHTMLStripCharFilterFactory -Dtests.method=testSingleEscapedTag 
-Dtests.seed=FD8C8301DD07CEFD -Dtests.locale=lt_LT 
-Dtests.timezone=America/Thule -Dtests.asserts=true 
-Dtests.file.encoding=US-ASCII
   [junit4] ERROR   0.00s J3 | 
TestHTMLStripCharFilterFactory.testSingleEscapedTag <<<
   [junit4]    > Throwable #1: java.lang.NullPointerException
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([FD8C8301DD07CEFD:36A72464D080D0F1]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.charfilter.BaseCharFilter.correctEnd(BaseCharFilter.java:82)
   [junit4]    >        at 
org.apache.lucene.analysis.CharFilter.correctEndOffset(CharFilter.java:93)
   [junit4]    >        at 
org.apache.lucene.analysis.Tokenizer.correctEndOffset(Tokenizer.java:84)
   [junit4]    >        at 
org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:176)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:177)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:295)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:299)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:303)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:327)
   [junit4]    >        at 
org.apache.lucene.analysis.charfilter.TestHTMLStripCharFilterFactory.testSingleEscapedTag(TestHTMLStripCharFilterFactory.java:99)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
{noformat}

> CharFilter offsets correction is wonky
> --------------------------------------
>
>                 Key: LUCENE-6595
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6595
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>         Attachments: LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue: 
> https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
>   ( -> 
>   ) -> 
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
>   Output offset: 0 1 2 3
>    Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1).  It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
>   cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6595) CharFilter offsets correction is wonky

Reply via email to