[GitHub] [lucene-solr] magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

GitBox Mon, 09 Dec 2019 07:31:35 -0800

magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to 
support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-563293051
 
 
   Ah, I see. The refactoring looks good.
   
   Separately (and wrt the test-failures induced by randomization): the 
`withoutUnicodeNormalization` randomization provides a good opportunity to make 
some documentation more explicit (and/or build more nuance/complexity into the 
handling of the `assumeExternalUnicodeNormalization` flag.
   
   The `assumeExternalUnicodeNormalization` arg cannot simply be flipped in 
isolation without affecting behavior. My conception of this arg is as an 
"expert" option that can significantly improve performance, but requires users 
to know (and externally compensate for, with specific ICUNormalizer2CharFilters 
in the top-level analysis chain) the specific type of input and output unicode 
normalization applied by a _particular_ transliterator.
   
   It's true that this makes it possible for users to "shoot themselves in the 
foot". One alternative would be to parse unicode normalization out of the 
Transliterator and effectively "rewrite" the analysis chain, pulling internal 
unicode normalization out to the top level as offset-aware 
ICUNormalizer2CharFilters. Pulling at this thread though leads to other 
potential considerations, such as:
   
   1. I looked, and couldn't find any precedent for this type of under-the-hood 
analysis-chain rewriting.
   2. Should we prune redundant unicode normalization induced by such 
rewriting? Particularly if the redundancy is introduced by rewriting adjacent 
Transliterators?
   3. The more we attempt to handle for the user, the more we encourage users 
to think that this is a one-size-fits-all drop-in replacement. I actually _do_ 
think it would be possible to design in this way, but I'm not sure I'd be 
prepared to guarantee it and my sense is that really doing it "right" -- 
flattening, respecting nested filters, propagating offsets -- could become 
quite complex.
   
   Might we consider proceeding in two phases? One with 
`assumeExternalUnicodeNormalization` conceived as an expert option that doesn't 
attempt anything more than disabling transliterator i/o normailzation, and a  
potential future step that would attempt rewriting composite (potentially 
hierarchical) Transliterators as top-level CharFilters (with all the complexity 
that might entail)? For the first phase, that could mean leaving tests as they 
were before the patch introducing the randomized disabling of normalization, 
and adding additional tests that explicitly disable normalization and 
compensate with top-level normalization as appropriate for the associated 
transliterator.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

Reply via email to