[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693545#action_12693545 ]
DM Smith commented on LUCENE-1581: ---------------------------------- This a bit larger of a problem. It also pertains to upper casing, too. I don't remember exactly, but I seem to remember that Java is behind with regard to the Unicode spec and Locale support (e.g. it does not include fa, farsi). I find that ICU4J keeps current with the spec. I don't remember which way it goes, maybe it's both, but some Locales don't have a corresponding upper or lower case for some characters. I'm not sure, but I think efficiency pertains to how it is normalized in Unicode (e.g. NFC, NFKC, NFD, or NFKD). These might produce different performance results. (It is a different issue, but it is critical that the search requests perform the same Unicode normalization as the indes. As Lucene does not have these normalization filters, I find, I have to do this externally in my own filters using ICU.) (Again a different issue: Another related kind of folding is that of base 10 number shaping.) Regarding: bq. I see no easy way (and efficient) to fix it. Suppose that we allow LowerCaseFilter to accept Locale. What would it do with it? I think that we need Upper and Lower case filters that operates on the token as a whole, using the string-level method to do case conversion. What I'd like to see is that lucene has a pluggable way to handle ICU, in so far as it does Locale specific things such as this. Such as using a base class UpperCaseFolder that provides the Java implementation, but that can take an alternate implementation, perhaps by reflection. > LowerCaseFilter should be able to be configured to use a specific locale. > ------------------------------------------------------------------------- > > Key: LUCENE-1581 > URL: https://issues.apache.org/jira/browse/LUCENE-1581 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Digy > > //Since I am a .Net programmer, Sample codes will be in c# but I don't think > that it would be a problem to understand them. > // > Assume an input text like "İ" and and analyzer like below > {code} > public class SomeAnalyzer : Analyzer > { > public override TokenStream TokenStream(string fieldName, > System.IO.TextReader reader) > { > TokenStream t = new SomeTokenizer(reader); > t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); > t = new LowerCaseFilter(t); > return t; > } > > } > {code} > > ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return > "i" (if locale is "en-US") > or > "ı' if(locale is "tr-TR") (that means,this token should be input to > another instance of ASCIIFoldingFilter) > So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, > but a better approach can be adding > a new constructor to LowerCaseFilter and forcing it to use a specific locale. > {code} > public sealed class LowerCaseFilter : TokenFilter > { > /* +++ */System.Globalization.CultureInfo CultureInfo = > System.Globalization.CultureInfo.CurrentCulture; > public LowerCaseFilter(TokenStream in) : base(in) > { > } > /* +++ */ public LowerCaseFilter(TokenStream in, > System.Globalization.CultureInfo CultureInfo) : base(in) > /* +++ */ { > /* +++ */ this.CultureInfo = CultureInfo; > /* +++ */ } > > public override Token Next(Token result) > { > result = Input.Next(result); > if (result != null) > { > char[] buffer = result.TermBuffer(); > int length = result.termLength; > for (int i = 0; i < length; i++) > /* +++ */ buffer[i] = > System.Char.ToLower(buffer[i],CultureInfo); > return result; > } > else > return null; > } > } > {code} > DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org