[
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393642#comment-15393642
]
AM commented on LUCENE-7393:
----------------------------
Agree, it is better ICU handle it. To clarify, you meant 1% is for rule base
syllable segmentation correct? Because dictionary base approach for word
segmentation would be definitely more than 1% (error rate). In the ICU
algorithm I noticed it does not segment person names. As a user, if ICU
algorithm could identify basic syllables + [Person, Location and Organizations]
would be ideal. But, dictionary is static and new words always popping up in
addition to context sensitive nature, so I'm not sure how to handle it. Rule
base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied
with it. Just also curious, where did you got the rules?
I didn't see the patch link though.
Thanks a lot.
> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
> Key: LUCENE-7393
> URL: https://issues.apache.org/jira/browse/LUCENE-7393
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 5.5
> Environment: Ubuntu
> Reporter: AM
> Attachments: LUCENE-7393.patch
>
>
> Lucene 4.10.3 correctly tokenize a syllable into one token. However in
> Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me
> know segmentation rules are implemented by native speakers of a particular
> language? In this particular example, it is M-y-a-n-m-a-r language. I have
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
> "tokens": [
> {
> "token": "နည်",
> "start_offset": 1,
> "end_offset": 4,
> "type": "<ALPHANUM>",
> "position": 1
> }
> ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
> "tokens": [
> {
> "token": "န",
> "start_offset": 0,
> "end_offset": 1,
> "type": "<ALPHANUM>",
> "position": 0
> },
> {
> "token": "ည်",
> "start_offset": 1,
> "end_offset": 3,
> "type": "<ALPHANUM>",
> "position": 1
> }
> ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]