magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to 
support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-565597955
 
 
   As discussed above, automatically adding external (top-level analysis chain) 
unicode normalization to compensate for `assumeExternalUnicodeNormalization` 
seems too heavy-handed, and potentially complex/brittle, to be introduced at 
this point as user-facing functionality.
   
   But for _tests_, where we know the specific Transliterators being used, it's 
fine to compensate automatically, esp. given that doing so allows us to benefit 
from randomized testing of the `assumeExternalUnicodeNormalization` arg.
   
   Commit 3f4fcbb is the patch uploaded above by @msokolov. Commit feae28c 
exposes package-private method 
`ICUTransformCharFilter.unicodeNormalizationType(String)` to enable tests to 
properly compensate for randomly flipping the 
`assumeExternalUnicodeNormalization` arg, preserving @msokolov's introduction 
of the randomization of that arg.
   
   In the course of work on commit feae28c, I realized that there are some 
Transliterators (e.g., from the tests, `Latin-Katakana`) that have 
leading/trailing sub-transliterators that effectively perform unicode 
normalization, but are not detected/removed by the 
`assumeExternalUnicodeNormalization` rule modification. In the case of 
`Latin-Katakana`, this is because the sub-transliterators in question are 
themselves composite transliterators). I'm not sure what (if anything) to do 
about this, but commit 92209c5 calls attention to the issue by adding detection 
for this condition, and provides a stub that could be used to log a warning to 
make things easier in the future if this should merit more attention.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to