[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Christophe Bismuth (JIRA) Thu, 22 Nov 2018 08:00:43 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16696052#comment-16696052
 ]


Christophe Bismuth commented on LUCENE-8548:
--------------------------------------------

Great! Thank you [~rcmuir], I'll dig into this (y)

> Reevaluate scripts boundary break in Nori's tokenizer
> -----------------------------------------------------
>
>                 Key: LUCENE-8548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8548
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: testCyrillicWord.dot.png
>
>
> This was first reported in https://issues.apache.org/jira/browse/LUCENE-8526:
> {noformat}
> Tokens are split on different character POS types (which seem to not quite 
> line up with Unicode character blocks), which leads to weird results for 
> non-CJK tokens:
> εἰμί is tokenized as three tokens: ε/SL(Foreign language) + ἰ/SY(Other 
> symbol) + μί/SL(Foreign language)
> ka̠k̚t͡ɕ͈a̠k̚ is tokenized as ka/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol) + t/SL(Foreign language) + 
> ͡ɕ͈/SY(Other symbol) + a/SL(Foreign language) + ̠/SY(Other symbol) + 
> k/SL(Foreign language) + ̚/SY(Other symbol)
> Ба̀лтичко̄ is tokenized as ба/SL(Foreign language) + ̀/SY(Other symbol) + 
> лтичко/SL(Foreign language) + ̄/SY(Other symbol)
> don't is tokenized as don + t; same for don't (with a curly apostrophe).
> אוֹג׳וּ is tokenized as אוֹג/SY(Other symbol) + וּ/SY(Other symbol)
> Мoscow (with a Cyrillic М and the rest in Latin) is tokenized as м + oscow
> While it is still possible to find these words using Nori, there are many 
> more chances for false positives when the tokens are split up like this. In 
> particular, individual numbers and combining diacritics are indexed 
> separately (e.g., in the Cyrillic example above), which can lead to a 
> performance hit on large corpora like Wiktionary or Wikipedia.
> Work around: use a character filter to get rid of combining diacritics before 
> Nori processes the text. This doesn't solve the Greek, Hebrew, or English 
> cases, though.
> Suggested fix: Characters in related Unicode blocks—like "Greek" and "Greek 
> Extended", or "Latin" and "IPA Extensions"—should not trigger token splits. 
> Combining diacritics should not trigger token splits. Non-CJK text should be 
> tokenized on spaces and punctuation, not by character type shifts. 
> Apostrophe-like characters should not trigger token splits (though I could 
> see someone disagreeing on this one).{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

Reply via email to