Branko Čibej <br...@apache.org> writes: > Not really. For example, 'á' and 'A' are equivalent, but 'ß' and 'SS' > are not — whereas the latter should be equivalent in German, but I doubt > very much that utf8proc does that right. Case-insensitive comparison > must *always* be done in the context of a well-defined locale. Anything > that calls itself "locale-independent" is likely to be wrong in a really > huge number of cases.
The Unicode Standard (Section 3.13 Default Case Algorithms) is quite clear on how case-insensitive matching should be done [1]: Default caseless matching is the process of comparing two strings for case-insensitive equality. The definitions of Unicode Default Caseless Matching build on the definitions of Unicode Default Case Folding. Default Caseless Matching uses full case folding: A string X is a caseless match for a string Y if and only if: toCasefold(X) = toCasefold(Y) toCasefold(X): Map each character C in X to Case_Folding(C). Case_Folding(C) uses the mappings with the status field value “C” or “F” in the data file CaseFolding.txt in the Unicode Character Database. When comparing strings for case-insensitive equality, the strings should also be normalized for most correct results. The behavior we get with this patch is well-defined and follows the spec, since we normalize and fold the case of the strings with utf8proc. (The UTF8PROC_CASEFOLD flag results in full C + F case folding as per [2], omitting special case T.) >> But I'm wondering why you added this feature to an existing function? >> >> I don't think it is recommended practice to perform the normalization this >> way and adding a boolean to an existing function makes it easier to do >> perform things in a not recommended way. > > Adding flags that drastically change the semantics of a function is just > broken API design, period. I don't think that we expose this functionality in a broken way. There aren't that many options to choose from, since we need to perform the normalization and the case folding in a single call to utf8proc, with appropriate flags set. We could add an svn_utf__casefold() function that does both, but I'd rather prefer what we have now. After all, the maintainers of utf8proc expose its features in a quite similar fashion [3] — with a normalize_string(..., casefold=true/false) function. [1] http://www.unicode.org/versions/Unicode8.0.0/ch03.pdf [2] http://www.unicode.org/Public/UNIDATA/CaseFolding.txt [3] https://julia.readthedocs.org/en/latest/stdlib/strings/#Base.normalize_string Regards, Evgeny Kotkov