[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

AM (JIRA) Tue, 26 Jul 2016 04:11:35 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393642#comment-15393642
 ]


AM commented on LUCENE-7393:
----------------------------

Agree, it is better ICU handle it.  To clarify, you meant 1% is for rule base 
syllable segmentation correct?  Because dictionary base approach for word 
segmentation would be definitely more than 1% (error rate).  In the ICU 
algorithm I noticed it does not segment person names.  As a user, if ICU 
algorithm could identify basic syllables + [Person, Location and Organizations] 
would be ideal.  But, dictionary is static and new words always popping up in 
addition to context sensitive nature, so I'm not sure how to handle it.  Rule 
base syllable algorithm is nearly to its perfection in Lucene and I'm satisfied 
with it.  Just also curious, where did you got the rules?

I didn't see the patch link though.  

Thanks a lot.  

> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
>                 Key: LUCENE-7393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.5
>         Environment: Ubuntu
>            Reporter: AM
>         Attachments: LUCENE-7393.patch
>
>
> Lucene 4.10.3 correctly tokenize a syllable into one token.  However in 
> Lucune 5.5.0 it end up being two tokens which is incorrect.  Please let me 
> know segmentation rules are implemented by native speakers of a particular 
> language? In this particular example, it is M-y-a-n-m-a-r language.  I have 
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.  
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>    "tokens": [
>       {
>          "token": "နည်",
>          "start_offset": 1,
>          "end_offset": 4,
>          "type": "<ALPHANUM>",
>          "position": 1
>       }
>    ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>   "tokens": [
>     {
>       "token": "န",
>       "start_offset": 0,
>       "end_offset": 1,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "ည်",
>       "start_offset": 1,
>       "end_offset": 3,
>       "type": "<ALPHANUM>",
>       "position": 1
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Reply via email to