[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Shai Erera (JIRA) Sun, 29 Mar 2009 08:10:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693568#action_12693568
 ]


Shai Erera commented on LUCENE-1581:
------------------------------------

bq. What I'd like to see is that lucene has a pluggable way to handle ICU, in 
so far as it does Locale specific things such as this. Such as using a base 
class UpperCaseFolder that provides the Java implementation, but that can take 
an alternate implementation, perhaps by reflection.

Why do this? What prevents you in your application from creating such a filter? 
Lucene does not provide too many analyzers, or a single Analyzer for use by 
all, with configurable options. So why provide in Lucene a filter which uses 
ICU4J? I'm asking that for core Lucene. Of course such a module can sit in 
contrib, as do the other analyzers for other languages ...

BTW, I've had some experience with ICU4J and it had several performance issues, 
such as large consecutive array allocations. It also operates on strings, and 
does not have the efficient API Lucene has in tokenization (i.e., working on 
char[]).
However, I've worked with it long time ago, and perhaps things have changed 
since.

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>       public class SomeAnalyzer : Analyzer
>       {
>               public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>               {
>                       TokenStream t = new SomeTokenizer(reader);
>                       t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>                       t = new LowerCaseFilter(t);
>                       return t;
>               }
>         
>       }
> {code}
>       
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>       "i" (if locale is "en-US") 
>       or 
>       "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
>               
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Reply via email to