[ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693540#action_12693540 ]
Shai Erera commented on LUCENE-1581: ------------------------------------ >From the javadocs >(http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#toLowerCase(char)): _In general, String.toLowerCase() should be used to map characters to lowercase. String case mapping methods have several benefits over Character case mapping methods. String case mapping methods can perform locale-sensitive mappings, context-sensitive mappings, and 1:M character mappings, whereas the Character case mapping methods cannot._ So I agree this is a problem, but I see no easy way (and efficient) to fix it. Suppose that we allow LowerCaseFilter to accept Locale. What would it do with it? We could add in LowerCaseFilter a Map<Locale, char[65536]> and allow one to pass in the Locale. If it's not null, and there's an entry in the map, lookup every character the filter receives. The lookup will be quite fast, as the character will serve as the index to the array (notice that we cover only 2-byte characters though) and if it's \uFFFF we can assume there's no special handling and call Character.toLowerCase. That is very fragile though as it's not easy to cover all the special case characters. Also, every time (including this one) we will find a special character that was not handled properly by the filter, it'd break back-compt, no? BTW, when characters are uppercase, I don't think we have a problem, as they will always be lowercased to a single character (even if it's the wrong one, it will be consistent in indexing and search). The problem comes with the lowercase characters. The character \u0131 (lowercase I in Turkish) is lowercased to \u0131, while its uppercase version (I) is lowercased to 'i'. Therefore there is a mismatch and we'll fail if the user will enter a lowercase query (as they often do). > LowerCaseFilter should be able to be configured to use a specific locale. > ------------------------------------------------------------------------- > > Key: LUCENE-1581 > URL: https://issues.apache.org/jira/browse/LUCENE-1581 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Digy > > //Since I am a .Net programmer, Sample codes will be in c# but I don't think > that it would be a problem to understand them. > // > Assume an input text like "İ" and and analyzer like below > {code} > public class SomeAnalyzer : Analyzer > { > public override TokenStream TokenStream(string fieldName, > System.IO.TextReader reader) > { > TokenStream t = new SomeTokenizer(reader); > t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); > t = new LowerCaseFilter(t); > return t; > } > > } > {code} > > ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return > "i" (if locale is "en-US") > or > "ı' if(locale is "tr-TR") (that means,this token should be input to > another instance of ASCIIFoldingFilter) > So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, > but a better approach can be adding > a new constructor to LowerCaseFilter and forcing it to use a specific locale. > {code} > public sealed class LowerCaseFilter : TokenFilter > { > /* +++ */System.Globalization.CultureInfo CultureInfo = > System.Globalization.CultureInfo.CurrentCulture; > public LowerCaseFilter(TokenStream in) : base(in) > { > } > /* +++ */ public LowerCaseFilter(TokenStream in, > System.Globalization.CultureInfo CultureInfo) : base(in) > /* +++ */ { > /* +++ */ this.CultureInfo = CultureInfo; > /* +++ */ } > > public override Token Next(Token result) > { > result = Input.Next(result); > if (result != null) > { > char[] buffer = result.TermBuffer(); > int length = result.termLength; > for (int i = 0; i < length; i++) > /* +++ */ buffer[i] = > System.Char.ToLower(buffer[i],CultureInfo); > return result; > } > else > return null; > } > } > {code} > DIGY -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org