[I] analysis-nori cannot analyze NFD-form Hangul as Korean morphemes [lucene]

via GitHub Wed, 10 Jun 2026 21:44:27 -0700


Incheonkirin opened a new issue, #16241:
URL: https://github.com/apache/lucene/issues/16241

KoreanTokenizer expects precomposed Hangul syllables. When the same Korean
text is supplied in NFD form, modern Hangul syllables become conjoining jamo
sequences and Nori falls back to UNKNOWN eojeol-sized tokens instead of
dictionary morphemes.

Minimal repro on current main: NFC "한국어 형태소를 분석합니다" produces
한국어/한국/어/형태소/형태/소/를/분석/합니다/하/ᄇ니다. NFD of the same text produces UNKNOWN tokens
for each whitespace-delimited Korean span — so NFD-sourced text indexed beside
NFC Korean text is silently unfindable.

This input shape is common in the wild: it is the same NFD text Korean users
see as "jamo-separated" filenames when macOS-created archives are opened
elsewhere. The same bytes reach indexes through filenames, extracted document
text, and metadata pipelines.

Proposed fix: an opt-in analysis-nori CharFilter that composes only modern
Hangul conjoining jamo sequences (L U+1100..U+1112, V U+1161..U+1175, optional
T U+11A8..U+11C2) into precomposed syllables before KoreanTokenizer, with
offset correction back to the original input. It is deliberately narrower than
NFC: a precomposed LV syllable followed by a trailing jamo is left unchanged
(that shape does not occur in NFD output). For general Unicode normalization
the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers
the common Korean-only case without adding the ICU dependency to nori
deployments.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] analysis-nori cannot analyze NFD-form Hangul as Korean morphemes [lucene]

Reply via email to