[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Robert Muir (Jira) Thu, 30 Jan 2020 05:17:44 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026675#comment-17026675
 ]


Robert Muir commented on LUCENE-9112:
-------------------------------------

Hi [~markus17]. Is it possible to add unit tests to TestSegmentingTokenizerBase 
that demonstrate the behavior? I'm not familiar with how OpenNLP uses this 
thing.

The idea of the patch seems good, but some unit tests would really help make it 
solid: e.g. we don't want to introduce some crazy corner case, and all the 
concrete subclasses are complicated (OpenNLP, Chinese, Thai). The test in the 
patch is really like an integration test as it is just testing opennlp which is 
really unrelated to your issue.

Sorry, I misunderstood your original problem. In your case OpenNLP is able to 
divide the thing into multiple sentences (mathematically it must, or your test 
would not pass). But keep in mind this still won't change anything for 
different cases such as "sentence" (according to opennlp) > 1024 chars long. 
For that, we'd need to allow the buffer size to be adjusted, maybe we should do 
that here too? It could be useful at least in making the unit test, and also 
simplifying the existing boundary tests in TestSegmentingTokenizerBase.

> SegmentingTokenizerBase splits terms that occupy 1024th positions in text
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch, 
> LUCENE-9112.patch, LUCENE-9112.patch, en-sent.bin, en-token.bin
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Reply via email to