[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Robert Muir (JIRA) Sun, 24 Jul 2016 06:02:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391056#comment-15391056
 ]


Robert Muir commented on LUCENE-7393:
-------------------------------------

OK, that is interesting to hear. I agree that fixing the hand-coded stuff looks 
tricky. From my perspective, the ideal solution would first use rules to find 
syllable breaks: this would restrict where breaks can happen at all, and then 
the dictionary would just refine that further.

Here is the link for the icu4j dictionary:
http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr/dictionaries/burmesedict.txt

Perhaps we should restore the old syllable rules, and make "syllable" vs "word" 
available as an option for Myanmar? 

I replaced these syllable rules with the ICU dictionary functionality, for two 
reasons:
1. Rules were of varying quality depending on language. Lao syllable splitting 
came from a paper (see 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.0.0/lucene/analysis/icu/src/data/uax29/Lao.rbbi)
 which claims > 98% accuracy. This is quite sophisticated and even has 
backtracking logic. On the other hand, I think the Myanmar rules were just 
something I came up with (unknown quality)...
2. Unclear if syllable is a good indexing unit for search. In my mind, 
syllable-as-token does make sense when the language is mostly monosyllabic, at 
the same time, we don't have any kind of advanced IR test suites for these 
languages to really know for sure.


> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
>                 Key: LUCENE-7393
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7393
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.5
>         Environment: Ubuntu
>            Reporter: AM
>
> Lucene 4.10.3 correctly tokenize a syllable into one token.  However in 
> Lucune 5.5.0 it end up being two tokens which is incorrect.  Please let me 
> know segmentation rules are implemented by native speakers of a particular 
> language? In this particular example, it is M-y-a-n-m-a-r language.  I have 
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.  
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>    "tokens": [
>       {
>          "token": "နည်",
>          "start_offset": 1,
>          "end_offset": 4,
>          "type": "<ALPHANUM>",
>          "position": 1
>       }
>    ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
>   "tokens": [
>     {
>       "token": "န",
>       "start_offset": 0,
>       "end_offset": 1,
>       "type": "<ALPHANUM>",
>       "position": 0
>     },
>     {
>       "token": "ည်",
>       "start_offset": 1,
>       "end_offset": 3,
>       "type": "<ALPHANUM>",
>       "position": 1
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7393) Incorrect ICUTokenization on South East Asian Language

Reply via email to