magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-563293051 Ah, I see. The refactoring looks good. Separately (and wrt the test-failures induced by randomization): the `withoutUnicodeNormalization` randomization provides a good opportunity to make some documentation more explicit (and/or build more nuance/complexity into the handling of the `assumeExternalUnicodeNormalization` flag. The `assumeExternalUnicodeNormalization` arg cannot simply be flipped in isolation without affecting behavior. My conception of this arg is as an "expert" option that can significantly improve performance, but requires users to know (and externally compensate for, with specific ICUNormalizer2CharFilters in the top-level analysis chain) the specific type of input and output unicode normalization applied by a _particular_ transliterator. It's true that this makes it possible for users to "shoot themselves in the foot". One alternative would be to parse unicode normalization out of the Transliterator and effectively "rewrite" the analysis chain, pulling internal unicode normalization out to the top level as offset-aware ICUNormalizer2CharFilters. Pulling at this thread though leads to other potential considerations, such as: 1. I looked, and couldn't find any precedent for this type of under-the-hood analysis-chain rewriting. 2. Should we prune redundant unicode normalization induced by such rewriting? Particularly if the redundancy is introduced by rewriting adjacent Transliterators? 3. The more we attempt to handle for the user, the more we encourage users to think that this is a one-size-fits-all drop-in replacement. I actually _do_ think it would be possible to design in this way, but I'm not sure I'd be prepared to guarantee it and my sense is that really doing it "right" -- flattening, respecting nested filters, propagating offsets -- could become quite complex. Might we consider proceeding in two phases? One with `assumeExternalUnicodeNormalization` conceived as an expert option that doesn't attempt anything more than disabling transliterator i/o normailzation, and a potential future step that would attempt rewriting composite (potentially hierarchical) Transliterators as top-level CharFilters (with all the complexity that might entail)? For the first phase, that could mean leaving tests as they were before the patch introducing the randomized disabling of normalization, and adding additional tests that explicitly disable normalization and compensate with top-level normalization as appropriate for the associated transliterator.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org