utf-test.c

Evgeny Kotkov Sat, 20 Feb 2016 13:17:22 -0800

Branko Čibej <br...@apache.org> writes:

> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
> are not — whereas the latter should be equivalent in German, but I doubt
> very much that utf8proc does that right. Case-insensitive comparison
> must *always* be done in the context of a well-defined locale. Anything
> that calls itself "locale-independent" is likely to be wrong in a really
> huge number of cases.


The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
on how case-insensitive matching should be done [1]:

    Default caseless matching is the process of comparing two strings for
    case-insensitive equality. The definitions of Unicode Default Caseless
    Matching build on the definitions of Unicode Default Case Folding.

       Default Caseless Matching uses full case folding:

       A string X is a caseless match for a string Y if and only if:
       toCasefold(X) = toCasefold(Y)

       toCasefold(X): Map each character C in X to Case_Folding(C).

       Case_Folding(C) uses the mappings with the status field value “C” or
       “F” in the data file CaseFolding.txt in the Unicode Character Database.

    When comparing strings for case-insensitive equality, the strings should
    also be normalized for most correct results.

The behavior we get with this patch is well-defined and follows the spec,
since we normalize and fold the case of the strings with utf8proc.  (The
UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
omitting special case T.)

>> But I'm wondering why you added this feature to an existing function?
>>
>> I don't think it is recommended practice to perform the normalization this
>> way and adding a boolean to an existing function makes it easier to do
>> perform things in a not recommended way.
>
> Adding flags that drastically change the semantics of a function is just
> broken API design, period.

I don't think that we expose this functionality in a broken way.  There aren't
that many options to choose from, since we need to perform the normalization
and the case folding in a single call to utf8proc, with appropriate flags set.
We could add an svn_utf__casefold() function that does both, but I'd rather
prefer what we have now.

After all, the maintainers of utf8proc expose its features in a quite similar
fashion [3] — with a normalize_string(..., casefold=true/false) function.

[1] http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf
[2] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
[3] 
https://julia.readthedocs.org/en/latest/stdlib/strings/#Base.normalize_string


Regards,
Evgeny Kotkov

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Reply via email to