[
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693591#action_12693591
]
Robert Muir commented on LUCENE-1581:
-------------------------------------
some comments I have on this topic:
the problems i have with default internationalization support in lucene revolve
around the following:
1. breaking text into words (parsing) is not unicode-sensitive
i.e. if i have a word containing s + macron (s̄) it will not tokenize it
correctly.
2. various filters like lowercase as mentioned here, but also accent removal
are not unicode-sensitive
i.e. if i have s + macron (s̄) it will not remove the macron.
this is not a normalization problem, but its true it also doesn't seem to work
correctly on decomposed NF(K)D text for similar reasons. in this example, there
is no composed form for s + macron available in unicode so I cannot 'hack'
around the problem by running NFC on this text before i feed it to lucene.
3. unicode text must be normalized so that both queries and text are in a
consistent representation.
one option I see is to have at least a basic analyzer that uses ICU to do the
following.
1. Break text into words correctly.
2. common filters to do things like lowercase and accent-removal correctly.
3. uses a filter to normalize text to one unicode normal form (say, NFKC by
default)
In my opinion, having this available would solve a majority of the current
problems.
I kinda started trying to implement some of this with lucene-1488... (at least
it does step 1!)
> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
> Key: LUCENE-1581
> URL: https://issues.apache.org/jira/browse/LUCENE-1581
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> public class SomeAnalyzer : Analyzer
> {
> public override TokenStream TokenStream(string fieldName,
> System.IO.TextReader reader)
> {
> TokenStream t = new SomeTokenizer(reader);
> t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> t = new LowerCaseFilter(t);
> return t;
> }
>
> }
> {code}
>
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> "i" (if locale is "en-US")
> or
> "ı' if(locale is "tr-TR") (that means,this token should be input to
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution,
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
> public sealed class LowerCaseFilter : TokenFilter
> {
> /* +++ */System.Globalization.CultureInfo CultureInfo =
> System.Globalization.CultureInfo.CurrentCulture;
> public LowerCaseFilter(TokenStream in) : base(in)
> {
> }
> /* +++ */ public LowerCaseFilter(TokenStream in,
> System.Globalization.CultureInfo CultureInfo) : base(in)
> /* +++ */ {
> /* +++ */ this.CultureInfo = CultureInfo;
> /* +++ */ }
>
> public override Token Next(Token result)
> {
> result = Input.Next(result);
> if (result != null)
> {
> char[] buffer = result.TermBuffer();
> int length = result.termLength;
> for (int i = 0; i < length; i++)
> /* +++ */ buffer[i] =
> System.Char.ToLower(buffer[i],CultureInfo);
> return result;
> }
> else
> return null;
> }
> }
> {code}
> DIGY
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]