utf-test.c

Branko Čibej Sun, 21 Feb 2016 05:23:08 -0800

On 20.02.2016 22:16, Evgeny Kotkov wrote:
> Branko Čibej <br...@apache.org> writes:
>
>> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS'
>> are not — whereas the latter should be equivalent in German, but I doubt
>> very much that utf8proc does that right. Case-insensitive comparison
>> must *always* be done in the context of a well-defined locale. Anything
>> that calls itself "locale-independent" is likely to be wrong in a really
>> huge number of cases.
> The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear
> on how case-insensitive matching should be done [1]:
>
>     Default caseless matching is the process of comparing two strings for
>     case-insensitive equality. The definitions of Unicode Default Caseless
>     Matching build on the definitions of Unicode Default Case Folding.
>
>        Default Caseless Matching uses full case folding:
>
>        A string X is a caseless match for a string Y if and only if:
>        toCasefold(X) = toCasefold(Y)
>
>        toCasefold(X): Map each character C in X to Case_Folding(C).
>
>        Case_Folding(C) uses the mappings with the status field value “C” or
>        “F” in the data file CaseFolding.txt in the Unicode Character Database.
>
>     When comparing strings for case-insensitive equality, the strings should
>     also be normalized for most correct results.
>
> The behavior we get with this patch is well-defined and follows the spec,
> since we normalize and fold the case of the strings with utf8proc.  (The
> UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2],
> omitting special case T.)


It may be well-defined and follow the spec, but the important question
IMO is whether the behaviour makes sense for our users. Bert mentioned
Turkish (i ∼ İ and I ∼ ı), I mentioned German, where ß ∼ SS, and French,
where ò ∼ O but that doesn't hold in Italian, where ò ∼ Ò ... these are
just off the top of my head, there must be tens or even hundreds of
examples where locale-independent case folding can give unexpected results.

Instead of relying on the Unicode spec, I propose a different approach:
to treat accented letters as if they don't have diacriticals at all.
This should be fairly easy to do with utf8proc: in the intermediate,
32-bit NFD string, remove any character that's in the
combining-diacritical group, and then convert the result to NFC UTF-8.
I've done this before with fairly good results; it's also much easier to
explain this behaviour to users than to tell them, "read the Unicode spec".

>>> But I'm wondering why you added this feature to an existing function?
>>>
>>> I don't think it is recommended practice to perform the normalization this
>>> way and adding a boolean to an existing function makes it easier to do
>>> perform things in a not recommended way.
>> Adding flags that drastically change the semantics of a function is just
>> broken API design, period.
> I don't think that we expose this functionality in a broken way.  There aren't
> that many options to choose from, since we need to perform the normalization
> and the case folding in a single call to utf8proc, with appropriate flags set.
> We could add an svn_utf__casefold() function that does both, but I'd rather
> prefer what we have now.
>
> After all, the maintainers of utf8proc expose its features in a quite similar
> fashion [3] — with a normalize_string(..., casefold=true/false) function.

Heh, quoting bad examples doesn't make the API change any better. :)

Personally I'd much prefer the svn_utf__casefold() you propose (i.e.,
normalize plus casefold) as a separate API. Internally, it can be
implemented with that extra flag, but even for a private API, I think
it's better to make each function do one thing.

-- Brane

Re: svn commit: r1731300 - in /subversion/trunk/subversion: include/private/svn_utf_private.h libsvn_repos/dump.c libsvn_subr/utf8proc.c svn/cl-log.h svn/log-cmd.c svn/svn.c tests/cmdline/log_tests.py tests/libsvn_subr/utf-test.c

Reply via email to