Incheonkirin opened a new pull request, #16242:
URL: https://github.com/apache/lucene/pull/16242

   Addresses #16241
   
   ## Summary
   
   Add an opt-in HangulCompositionCharFilter to analysis-nori. The filter 
composes modern Hangul conjoining-jamo sequences into precomposed Hangul 
syllables before KoreanTokenizer, so NFD-form Korean text can analyze like the 
equivalent NFC text while preserving offset correction back to the original 
input.
   
   The filter is intentionally narrow: it handles only modern L/V/optional-T 
conjoining jamo sequences and leaves compatibility jamo, archaic jamo, partial 
sequences, already-precomposed Korean text, and non-Hangul text unchanged. For 
general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains 
the right tool; this covers the common Korean-only case without adding the ICU 
dependency to nori deployments.
   
   ## Tests
   
   - NFD Korean sentence through HangulCompositionCharFilter + KoreanTokenizer 
matches NFC KoreanTokenizer terms/POS
   - offsets from analyzed NFD text map back to the original NFD input
   - randomized modern Hangul NFD composition matches NFC
   - non-modern and partial jamo sequences unchanged
   - already-NFC and no-op inputs unchanged
   - precomposed-LV + trailing jamo passthrough (out-of-scope shape unchanged)
   - factory registration
   - bogus factory arguments
   - random analyzer data
   
   ## Verification
   
   - ./gradlew :lucene:analysis:nori:tidy
   - ./gradlew :lucene:analysis:nori:check
   - ./gradlew check
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to