[
https://issues.apache.org/jira/browse/LUCENE-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391056#comment-15391056
]
Robert Muir commented on LUCENE-7393:
-------------------------------------
OK, that is interesting to hear. I agree that fixing the hand-coded stuff looks
tricky. From my perspective, the ideal solution would first use rules to find
syllable breaks: this would restrict where breaks can happen at all, and then
the dictionary would just refine that further.
Here is the link for the icu4j dictionary:
http://source.icu-project.org/repos/icu/icu/trunk/source/data/brkitr/dictionaries/burmesedict.txt
Perhaps we should restore the old syllable rules, and make "syllable" vs "word"
available as an option for Myanmar?
I replaced these syllable rules with the ICU dictionary functionality, for two
reasons:
1. Rules were of varying quality depending on language. Lao syllable splitting
came from a paper (see
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.0.0/lucene/analysis/icu/src/data/uax29/Lao.rbbi)
which claims > 98% accuracy. This is quite sophisticated and even has
backtracking logic. On the other hand, I think the Myanmar rules were just
something I came up with (unknown quality)...
2. Unclear if syllable is a good indexing unit for search. In my mind,
syllable-as-token does make sense when the language is mostly monosyllabic, at
the same time, we don't have any kind of advanced IR test suites for these
languages to really know for sure.
> Incorrect ICUTokenization on South East Asian Language
> ------------------------------------------------------
>
> Key: LUCENE-7393
> URL: https://issues.apache.org/jira/browse/LUCENE-7393
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 5.5
> Environment: Ubuntu
> Reporter: AM
>
> Lucene 4.10.3 correctly tokenize a syllable into one token. However in
> Lucune 5.5.0 it end up being two tokens which is incorrect. Please let me
> know segmentation rules are implemented by native speakers of a particular
> language? In this particular example, it is M-y-a-n-m-a-r language. I have
> understood that L-a-o, K-m-e-r and M-y-a-n-m-a-r fall into ICU category.
> Thanks a lot.
> h4. Example 4.10.3
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
> "tokens": [
> {
> "token": "နည်",
> "start_offset": 1,
> "end_offset": 4,
> "type": "<ALPHANUM>",
> "position": 1
> }
> ]
> }
> {code}
> h4. Example 5.5.0
> {code:javascript}
> GET _analyze?tokenizer=icu_tokenizer&text="နည်"
> {
> "tokens": [
> {
> "token": "န",
> "start_offset": 0,
> "end_offset": 1,
> "type": "<ALPHANUM>",
> "position": 0
> },
> {
> "token": "ည်",
> "start_offset": 1,
> "end_offset": 3,
> "type": "<ALPHANUM>",
> "position": 1
> }
> ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]