[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Markus Jelsma (Jira) Thu, 30 Jan 2020 02:58:23 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026591#comment-17026591
 ]


Markus Jelsma commented on LUCENE-9112:
---------------------------------------

Hello Robert, 

I asked my colleague Jurian Broertjes (credit), who happens to have an 
excellent in-built debugger for foreign code, to find a fix. Instead of 
advancing the buffer for each sentence, the buffer is advanced when it needs to.

This fixes the unit test. I'll attach a new patch with the test and the fix. 
You will need the attached models en-sent.bin and en-token.bin as well.

The custom models are not strictly necessary here, the problem persists with 
the existing models. If needed, i can adjust the patch once more to use only 
the existing models. The extracted tokens are just slightly different.

There are some issues though, after patching master we see these tests failing 
all the time (only when all tests are run):
{code}
   [junit4]   - org.apache.lucene.util.TestReproduceMessage.testAssumeRule
   [junit4]   - org.apache.lucene.util.TestReproduceMessage.testAssumeTest
   [junit4]   - 
org.apache.lucene.util.TestReproduceMessage.testAssumeBeforeClass
   [junit4]   - org.apache.lucene.util.TestReproduceMessage.testAssumeAfterClass
   [junit4]   - org.apache.lucene.util.TestReproduceMessage.testAssumeBefore
   [junit4]   - 
org.apache.lucene.util.TestReproduceMessage.testAssumeInitializer
   [junit4]   - org.apache.lucene.util.TestSysoutsLimits.testOverSoftLimit
   [junit4]   - org.apache.lucene.util.TestReproduceMessage.testAssumeAfter
   [junit4]   - org.apache.lucene.util.TestSysoutsLimits.OverHardLimit
{code}

But ant -Dtestcase=TestReproduceMessage test passes just fine. I ran the all 
tests multiple times and these now fail consistently. Any ideas?

What do you think?

> SegmentingTokenizerBase splits terms that occupy 1024th positions in text
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch, 
> en-sent.bin, en-token.bin
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Reply via email to