Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

via GitHub Fri, 23 May 2025 02:58:05 -0700


praveen-d291 commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2903917694


   Hey @rmuir ,
   
   Thanks for the explanation. I've been thinking about the TeluguAnalyzer's 
default behavior, and I believe we have a significant hidden issue. The 
analyzer bundles IndicNormalizationFilter, which implicitly converts వు -> మ. 
This conflation isn't documented anywhere within TeluguAnalyzer, so users won't 
realize it's happening. Even the link in IndicNormalizer.java 
(http://ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html) is no longer 
accessible. This behavior is going to confuse any native speaker.
   
   The bundled Telugu Analyzer is assuming that most users are going to use 
Lucene on text written in legacy fonts, extracted from PDF, etc. But, that 
might not be true now with unicode support for Telugu as is. I have two options 
in my mind..
   
   Option 1: Fix the Default (My Preference)
   I'd propose adding a boolean option to the TeluguAnalyzer constructor to 
control IndicNormalizationFilter inclusion, and make its default false. This 
would make TeluguAnalyzer precise right out of the box for modern documents. 
Users with older, less-formatted text could still explicitly enable it. I 
believe this is a necessary correction for linguistic accuracy and explicitly 
document this conversion.
   
   Option 2: Document the behavior in TeluguAnalyzer
   Alternatively, we could document this specific behavior in the 
TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom 
analyzer to avoid it.
   
   Option 1 feels like the right long-term fix for the default user experience. 
What do you think? I can raise a PR after agreeing on this topic.
   
   As a native Telugu speaker who loves Lucene, I'm keen to help out!
   
   cc @Trey314159 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

Reply via email to