[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Robert Muir (Jira) Fri, 31 Jan 2020 12:47:43 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027820#comment-17027820
 ]


Robert Muir commented on LUCENE-9112:
-------------------------------------

If you look at the current unit test it defines a couple test subclasses that 
are very simple. For example it has {{WholeSentenceTokenizer}} which simply 
returns the sentences directly from SegmentingTokenizerBase as whole tokens.

Could we make a simple test based on this that fails without the patch and 
passes with it?

If the buffer size is configurable, then it could be set very small in the test 
(e.g. 5 or something) and we could test all possibilities very easily?

> SegmentingTokenizerBase splits terms that occupy 1024th positions in text
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch, 
> LUCENE-9112.patch, LUCENE-9112.patch, en-sent.bin, en-token.bin
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious 
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the 
> token
> # much further down the text, a seemingly unrelated token is then suddenly 
> split up, in my example (see attached unit test) the name 'Baron' is  split 
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it 
> certainly does in real world examples. Depending on how many 'spurious' dots, 
> a completely different term can become split, or the same term in just a 
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is 
> a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-9112) SegmentingTokenizerBase splits terms that occupy 1024th positions in text

Reply via email to