[
https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027702#comment-17027702
]
Markus Jelsma commented on LUCENE-9112:
---------------------------------------
Hello Robert,
I agree, it is useful to have an adjustable buffer size. And it would indeed
help for some edge cases i already came across. Because, if indeed the sentence
is longer than 1024, the problem remains. I have not seen many but of the 50k+
sentences read from Wikipedia (Dutch) so far, there were a handful certainly
larger than 1024.
I'm fine with additional tests but what exactly would you like to see tested?
The test i made is indeed more integration but it does demonstrate the problem.
What do you think?
> SegmentingTokenizerBase splits terms that occupy 1024th positions in text
> -------------------------------------------------------------------------
>
> Key: LUCENE-9112
> URL: https://issues.apache.org/jira/browse/LUCENE-9112
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: master (9.0)
> Reporter: Markus Jelsma
> Priority: Major
> Labels: opennlp
> Fix For: master (9.0)
>
> Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch,
> LUCENE-9112.patch, LUCENE-9112.patch, en-sent.bin, en-token.bin
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious
> punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the
> token
> # much further down the text, a seemingly unrelated token is then suddenly
> split up, in my example (see attached unit test) the name 'Baron' is split
> into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it
> certainly does in real world examples. Depending on how many 'spurious' dots,
> a completely different term can become split, or the same term in just a
> different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is
> a problem and i have a Lucene unit test proving the problem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]