On 20.02.2016 22:16, Evgeny Kotkov wrote: > Branko Čibej <br...@apache.org> writes: > >> Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS' >> are not — whereas the latter should be equivalent in German, but I doubt >> very much that utf8proc does that right. Case-insensitive comparison >> must *always* be done in the context of a well-defined locale. Anything >> that calls itself "locale-independent" is likely to be wrong in a really >> huge number of cases. > The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear > on how case-insensitive matching should be done [1]: > > Default caseless matching is the process of comparing two strings for > case-insensitive equality. The definitions of Unicode Default Caseless > Matching build on the definitions of Unicode Default Case Folding. > > Default Caseless Matching uses full case folding: > > A string X is a caseless match for a string Y if and only if: > toCasefold(X) = toCasefold(Y) > > toCasefold(X): Map each character C in X to Case_Folding(C). > > Case_Folding(C) uses the mappings with the status field value “C” or > “F” in the data file CaseFolding.txt in the Unicode Character Database. > > When comparing strings for case-insensitive equality, the strings should > also be normalized for most correct results. > > The behavior we get with this patch is well-defined and follows the spec, > since we normalize and fold the case of the strings with utf8proc. (The > UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2], > omitting special case T.)
It may be well-defined and follow the spec, but the important question IMO is whether the behaviour makes sense for our users. Bert mentioned Turkish (i ∼ İ and I ∼ ı), I mentioned German, where ß ∼ SS, and French, where ò ∼ O but that doesn't hold in Italian, where ò ∼ Ò ... these are just off the top of my head, there must be tens or even hundreds of examples where locale-independent case folding can give unexpected results. Instead of relying on the Unicode spec, I propose a different approach: to treat accented letters as if they don't have diacriticals at all. This should be fairly easy to do with utf8proc: in the intermediate, 32-bit NFD string, remove any character that's in the combining-diacritical group, and then convert the result to NFC UTF-8. I've done this before with fairly good results; it's also much easier to explain this behaviour to users than to tell them, "read the Unicode spec". >>> But I'm wondering why you added this feature to an existing function? >>> >>> I don't think it is recommended practice to perform the normalization this >>> way and adding a boolean to an existing function makes it easier to do >>> perform things in a not recommended way. >> Adding flags that drastically change the semantics of a function is just >> broken API design, period. > I don't think that we expose this functionality in a broken way. There aren't > that many options to choose from, since we need to perform the normalization > and the case folding in a single call to utf8proc, with appropriate flags set. > We could add an svn_utf__casefold() function that does both, but I'd rather > prefer what we have now. > > After all, the maintainers of utf8proc expose its features in a quite similar > fashion [3] — with a normalize_string(..., casefold=true/false) function. Heh, quoting bad examples doesn't make the API change any better. :) Personally I'd much prefer the svn_utf__casefold() you propose (i.e., normalize plus casefold) as a separate API. Internally, it can be implemented with that extra flag, but even for a private API, I think it's better to make each function do one thing. -- Brane