[ 
https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784472#action_12784472
 ] 

Uwe Schindler commented on LUCENE-2102:
---------------------------------------

One othe possibility to resolve the problem in a completely different way: You 
could wrap a MappingCharFilter on top of the input reader in Analyzer and just 
add a replacement for this one char:
[http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/MappingCharFilter.html]

This would be a very easy fix without code duplication. You just change the 
input before tokenization. And its already in Lucene core, just plug it into 
the analyzer's tokenStream() or reusableTokenStream() method as a wrapper 
around the Reader param.

This would be very easy also for the other analyzers having problem with seldom 
chars. It can also be used to remove chars at all or replace them by longer 
sequences like รค -> ae (for german).

> LowerCaseFilter for Turkish language
> ------------------------------------
>
>                 Key: LUCENE-2102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2102
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish 
> alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to