[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

AM (JIRA) Sun, 24 Jul 2016 17:36:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391211#comment-15391211
 ]


AM commented on LUCENE-7393:
----------------------------

Yes, syllable vs word option would be perfect.  For the dictionary base 
approach, some of the words might not always be correct, since semantic meaning 
of a word depends on the context. For example, 'ရန်ကုန်' means Yangon city and 
'ကုန်သည်' means trader.  But, when we have overlap in the phrase like 
တက်လာရန်ကုန်သည်များက it should be segmented as တက်|လာ|ရန်|ကုန်သည်|များ|က, 
instead of  တက်|လာ|ရန်ကုန်|သည်|များ|က.  As you can see, syllable ကုန် is the 
overlap.  Both words could be in the dictionary and it would require context 
knowledge to select the correct word and it would be very hard with 
hand-crafted algorithms.  Anyways, it is still good to have until we have 
better language understanding.  

Would it be possible to add other words not in the ICU dictionary during 
analysis? 

Thanks a lot. 

> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
>                 Key: LUCENE-7393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.5
>         Environment: Ubuntu
>            Reporter: AM
>
> Lucene 4.10.3 correctly tokenize a syllable into one token.  However in 
> Lucune 5.5.0 it end up being two tokens which is incorrect.  Please let me 
> know segmentation rules are implemented by native speakers of a particular 
> language? In this particular example, it is M-y-a-n-m-a-r language.  I have 
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.  
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>    "tokens": [
>       {
>          "token": "နည်",
>          "start_offset": 1,
>          "end_offset": 4,
>          "type": "<ALPHANUM>",
>          "position": 1
>       }
>    ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>   "tokens": [
>     {
>       "token": "န",
>       "start_offset": 0,
>       "end_offset": 1,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "ည်",
>       "start_offset": 1,
>       "end_offset": 3,
>       "type": "<ALPHANUM>",
>       "position": 1
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Reply via email to