[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language

Robert Muir (JIRA) Tue, 01 Dec 2009 14:57:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784499#action_12784499
 ]


Robert Muir commented on LUCENE-2102:
-------------------------------------

Uwe, it *is* specific to the turkish case.
because for german, whether you have A, umlaut or A+umlaut as one character, it 
works regardless.
turkish is the only case where its more complex, because the casing of the 
character actually depends upon a diacritic that may not be composed, and may 
have other diacritics in between.

this is what makes it such a bear to support in case folding:

{noformat}
#      Note that the Turkic mappings do not maintain canonical equivalence 
without additional processing.
#      See the discussions of case mapping in the Unicode Standard for more 
information.
{noformat}

The problem is that context is required, and sometimes marks must actually be 
deleted for proper casing.

{noformat}
# When lowercasing, remove dot_above in the sequence I + dot_above, which will 
turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

# When lowercasing, unless an I is before a dot_above, it turns into a dotless 
i.

0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
{noformat}

bq. but the last time I was there, they just used the simpliest composed chars 
(like germans).

This is why i recommended we not go crazy and only work on the composed form. 
But in the future we might want to correct this.
this is *impossible* to do with mappingcharfilter, that is my only point.

> LowerCaseFilter for Turkish language
> ------------------------------------
>
>                 Key: LUCENE-2102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2102
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Ahmet Arslan
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish 
> alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language

Reply via email to