praveen-d291 commented on issue #14659: URL: https://github.com/apache/lucene/issues/14659#issuecomment-2903917694
Hey @rmuir , Thanks for the explanation. I've been thinking about the TeluguAnalyzer's default behavior, and I believe we have a significant hidden issue. The analyzer bundles IndicNormalizationFilter, which implicitly converts వు -> మ. This conflation isn't documented anywhere within TeluguAnalyzer, so users won't realize it's happening. Even the link in IndicNormalizer.java (http://ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html) is no longer accessible. This behavior is going to confuse any native speaker. The bundled Telugu Analyzer is assuming that most users are going to use Lucene on text written in legacy fonts, extracted from PDF, etc. But, that might not be true now with unicode support for Telugu as is. I have two options in my mind.. Option 1: Fix the Default (My Preference) I'd propose adding a boolean option to the TeluguAnalyzer constructor to control IndicNormalizationFilter inclusion, and make its default false. This would make TeluguAnalyzer precise right out of the box for modern documents. Users with older, less-formatted text could still explicitly enable it. I believe this is a necessary correction for linguistic accuracy and explicitly document this conversion. Option 2: Document the behavior in TeluguAnalyzer Alternatively, we could document this specific behavior in the TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom analyzer to avoid it. Option 1 feels like the right long-term fix for the default user experience. What do you think? I can raise a PR after agreeing on this topic. As a native Telugu speaker who loves Lucene, I'm keen to help out! cc @Trey314159 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org