Note that Unicode 3.2 (where these links point to) is over 10 years old. The current version is 6.2... and things haven’t got easier with newer versions.
A different route would be to convert the characters from our utf-8 to the native platform encoding (using our existing apis for that) and then make the platform do the case folding for us before apr does the comparison/search. Bert Sent from Windows Mail From: Ivan Zhakov Sent: Sunday, April 21, 2013 8:44 PM To: Branko Čibej Cc: dev@subversion.apache.org On Sun, Apr 21, 2013 at 10:07 PM, Branko Čibej <br...@wandisco.com> wrote: > On 21.04.2013 17:11, Ivan Zhakov wrote: >> On Sun, Apr 21, 2013 at 4:48 PM, Branko Čibej <br...@wandisco.com> wrote: >>> On 21.04.2013 14:05, Stefan Sperling wrote: >>>> On Sun, Apr 21, 2013 at 01:53:43PM +0200, Bert Huijben wrote: >>>>> I'd rather pull the case insensitive search part of this new in 1.8 >>>>> search feature and do it right in 1.9. >>>> What's the issue with the current implementation apart from the >>>> test failures on Windows? >>>> >>>> The behaviour of 'svn log --search' regarding case-sensitivity >>>> isn't even documented, so we're not really prosmising anything. >>>> >>>> It is possible that some users who are using languages other than >>>> English will complain, since ASCII is being matched case-insensitively, >>>> and all other characters are being matched case-sensitively. >>>> But this is due to a missing feature in APR's implemention of fnmatch(). >>>> >>>> Provided we can fix the 1.8.x tests on Windows I see no reason to >>>> change our implementation of log --search. We can simply wait for >>>> APR to grow the necessary support for multibyte strings. >>> The wc-collate-path branch has an svn_utf__glob function that's mainly >>> intended for use by SQLite, however, it can be a replacement for >>> apr_fnmatch. It uses apr_fnmatch internally, but decomposes the inputs >>> to Unicode normalization form D, which keeps diacriticals separate from >>> the base letters. In other words, we could easily extend that to do >>> completely diacritical-agnostic case-folding matching for Latin >>> alphabets (and probably also for Cyrillic scripts). >>> >>> The idea to manually hack things to work with western Latin alphabets >>> seems completely wrong-headed to me. >>> >>> But yes; in general, case folding is locale-specific. If we wanted to >>> support that, we'd need ICU instead of utf8proc. I can imagine that >>> eventually being an option, but not a mandatory dependency. >>> >> According to Unicode case folding data [1] the only two characters >> uses locale specific case-folding. > > How on earth did you come to that conclusion? > > Yes, the obvious ones are German (ß == SS) equivalence and turkic (i == > İ) and (ı == I) equivalences (and that's aready three characters); but > then in French, lowercase accented letters are equivalent to uppercase > unaccented letters, whereas for example in Spanish that's not the case. > And that's just looking at European and West Asian Latin scripts. There > are at least 7 distinct Cyrillic scripts in roughly the same area that > I'm aware of, and I certainly don't know the case-folding rules for all > of them. > I've just read Unicode specs, but I didn't read all of them :) According to the link I provided [1] there are 4 types of characters in terms of case folding: [[[ # The status field is: # C: common case folding, common mappings shared by both simple and full mappings. # F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces. # S: simple case folding, mappings to single characters where different from F. # T: special case for uppercase I and dotted uppercase I # - For non-Turkic languages, this mapping is normally not used. # - For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. ]]] In CaseFolding-3.2.0.txt the only 'T' chars needs locale depended handling. BUT there is another document that describes special case-folding rules [2] which list cases like ß == SS. I missed it. That's my fault. [1] http://www.unicode.org/Public/3.2-Update/CaseFolding-3.2.0.txt [2] http://www.unicode.org/Public/3.2-Update/SpecialCasing-3.2.0.txt -- Ivan Zhakov CTO | VisualSVN | http://www.visualsvn.com