[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Shai Erera (JIRA) Sun, 29 Mar 2009 03:53:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693540#action_12693540
 ]


Shai Erera commented on LUCENE-1581:
------------------------------------

>From the javadocs 
>(http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#toLowerCase(char)):

_In general, String.toLowerCase() should be used to map characters to 
lowercase. String case mapping methods have several benefits over Character 
case mapping methods. String case mapping methods can perform locale-sensitive 
mappings, context-sensitive mappings, and 1:M character mappings, whereas the 
Character case mapping methods cannot._

So I agree this is a problem, but I see no easy way (and efficient) to fix it. 
Suppose that we allow LowerCaseFilter to accept Locale. What would it do with 
it? We could add in LowerCaseFilter a Map<Locale, char[65536]> and allow one to 
pass in the Locale. If it's not null, and there's an entry in the map, lookup 
every character the filter receives. The lookup will be quite fast, as the 
character will serve as the index to the array (notice that we cover only 
2-byte characters though) and if it's \uFFFF we can assume there's no special 
handling and call Character.toLowerCase.

That is very fragile though as it's not easy to cover all the special case 
characters. Also, every time (including this one) we will find a special 
character that was not handled properly by the filter, it'd break back-compt, 
no?

BTW, when characters are uppercase, I don't think we have a problem, as they 
will always be lowercased to a single character (even if it's the wrong one, it 
will be consistent in indexing and search). The problem comes with the 
lowercase characters. The character \u0131 (lowercase I in Turkish) is 
lowercased to \u0131, while its uppercase version (I) is lowercased to 'i'. 
Therefore there is a mismatch and we'll fail if the user will enter a lowercase 
query (as they often do).

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>       public class SomeAnalyzer : Analyzer
>       {
>               public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>               {
>                       TokenStream t = new SomeTokenizer(reader);
>                       t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>                       t = new LowerCaseFilter(t);
>                       return t;
>               }
>         
>       }
> {code}
>       
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>       "i" (if locale is "en-US") 
>       or 
>       "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
>               
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Reply via email to