Namgyu Kim created LUCENE-8553:
----------------------------------
Summary: New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
Key: LUCENE-8553
URL: https://issues.apache.org/jira/browse/LUCENE-8553
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Reporter: Namgyu Kim
This is a patch for KoreanDecomposeFilter.
This filter can be used to decompose Hangul.
(ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)
Hangul input is very unique.
If you want to type apple in English,
you can type it in the order {color:#FF0000}a -> p -> p -> l -> e{color}.
However, if you want to input "Hangul" in Hangul,
you have to type it in the order of {color:#FF0000}ㅎ -> ㅏ -> ㄴ -> ㄱ -> ㅡ ->
ㄹ{color}.
(Because of the keyboard shape)
This means that spell check with existing full Hangul can be less accurate.
The structure of Hangul consists of elements such as *"Choseong"*,
*"Jungseong"*, and *"Jongseong"*.
These three elements are called *"Jamo"*.
If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
*"Choseong"* means {color:#FF0000}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
*"Jungseong"* means {color:#FF0000}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
*"Jongseong"* means {color:#FF0000}"ㄴ, ㅇ"{color}.
The reason for Jamo separation is explained above. (spell check)
Also, the reason we need "Choseong Filter" is because many Koreans use
*"Choseong Search"* (especially in mobile environment).
If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
For that reason, I think it would be useful to provide a filter that can be
searched by "ㄷㅈㅉㄱ".
Hangul also has *dual chars*, such as
"ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".
For such reasons,
KoreanDecompose offers *5 options*,
ex) *된장찌개* => [된장], [찌개]
*1) ORIGIN*
[된장], [찌개]
*2) SINGLECHOSEONG*
[ㄷㅈ], [ㅉㄱ]
*3) DUALCHOSEONG*
[ㄷㅈ], [ㅈㅈㄱ]
*4) SINGLEJAMO*
[ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ]
*5) DUALJAMO*
[ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]