[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

DM Smith (JIRA) Tue, 01 Dec 2009 11:31:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784358#action_12784358
 ]


DM Smith commented on LUCENE-1581:
----------------------------------

bq. ultimately I still think case folding is the right way for this across the 
board, but you can't do that without a 3rd party library.
I think that any project that is serious about texts in various non-latin 
languages will already be using ICU4J. The significant non-latin language work 
is in contrib. 3-rd party libraries are not that big a deal there.

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>         Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>       public class SomeAnalyzer : Analyzer
>       {
>               public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>               {
>                       TokenStream t = new SomeTokenizer(reader);
>                       t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>                       t = new LowerCaseFilter(t);
>                       return t;
>               }
>         
>       }
> {code}
>       
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>       "i" (if locale is "en-US") 
>       or 
>       "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
>               
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Reply via email to