Incheonkirin opened a new issue, #16241:
URL: https://github.com/apache/lucene/issues/16241

   KoreanTokenizer expects precomposed Hangul syllables. When the same Korean 
text is supplied in NFD form, modern Hangul syllables become conjoining jamo 
sequences and Nori falls back to UNKNOWN eojeol-sized tokens instead of 
dictionary morphemes.
   
   Minimal repro on current main: NFC "한국어 형태소를 분석합니다" produces 
한국어/한국/어/형태소/형태/소/를/분석/합니다/하/ᄇ니다. NFD of the same text produces UNKNOWN tokens 
for each whitespace-delimited Korean span — so NFD-sourced text indexed beside 
NFC Korean text is silently unfindable.
   
   This input shape is common in the wild: it is the same NFD text Korean users 
see as "jamo-separated" filenames when macOS-created archives are opened 
elsewhere. The same bytes reach indexes through filenames, extracted document 
text, and metadata pipelines.
   
   Proposed fix: an opt-in analysis-nori CharFilter that composes only modern 
Hangul conjoining jamo sequences (L U+1100..U+1112, V U+1161..U+1175, optional 
T U+11A8..U+11C2) into precomposed syllables before KoreanTokenizer, with 
offset correction back to the original input. It is deliberately narrower than 
NFC: a precomposed LV syllable followed by a trailing jamo is left unchanged 
(that shape does not occur in NFD output). For general Unicode normalization 
the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers 
the common Korean-only case without adding the ICU dependency to nori 
deployments.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to