[
https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12784499#action_12784499
]
Robert Muir commented on LUCENE-2102:
-------------------------------------
Uwe, it *is* specific to the turkish case.
because for german, whether you have A, umlaut or A+umlaut as one character, it
works regardless.
turkish is the only case where its more complex, because the casing of the
character actually depends upon a diacritic that may not be composed, and may
have other diacritics in between.
this is what makes it such a bear to support in case folding:
{noformat}
# Note that the Turkic mappings do not maintain canonical equivalence
without additional processing.
# See the discussions of case mapping in the Unicode Standard for more
information.
{noformat}
The problem is that context is required, and sometimes marks must actually be
deleted for proper casing.
{noformat}
# When lowercasing, remove dot_above in the sequence I + dot_above, which will
turn into i.
# This matches the behavior of the canonically equivalent I-dot_above
0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
# When lowercasing, unless an I is before a dot_above, it turns into a dotless
i.
0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
{noformat}
bq. but the last time I was there, they just used the simpliest composed chars
(like germans).
This is why i recommended we not go crazy and only work on the composed form.
But in the future we might want to correct this.
this is *impossible* to do with mappingcharfilter, that is my only point.
> LowerCaseFilter for Turkish language
> ------------------------------------
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 3.0
> Reporter: Ahmet Arslan
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish
> alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]