magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-565597955 As discussed above, automatically adding external (top-level analysis chain) unicode normalization to compensate for `assumeExternalUnicodeNormalization` seems too heavy-handed, and potentially complex/brittle, to be introduced at this point as user-facing functionality. But for _tests_, where we know the specific Transliterators being used, it's fine to compensate automatically, esp. given that doing so allows us to benefit from randomized testing of the `assumeExternalUnicodeNormalization` arg. Commit 3f4fcbb is the patch uploaded above by @msokolov. Commit feae28c exposes package-private method `ICUTransformCharFilter.unicodeNormalizationType(String)` to enable tests to properly compensate for randomly flipping the `assumeExternalUnicodeNormalization` arg, preserving @msokolov's introduction of the randomization of that arg. In the course of work on commit feae28c, I realized that there are some Transliterators (e.g., from the tests, `Latin-Katakana`) that have leading/trailing sub-transliterators that effectively perform unicode normalization, but are not detected/removed by the `assumeExternalUnicodeNormalization` rule modification. In the case of `Latin-Katakana`, this is because the sub-transliterators in question are themselves composite transliterators). I'm not sure what (if anything) to do about this, but commit 92209c5 calls attention to the issue by adding detection for this condition, and provides a stub that could be used to log a warning to make things easier in the future if this should merit more attention.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org