Branko Čibej <br...@apache.org> writes: > Personally I'd much prefer the svn_utf__casefold() you propose (i.e., > normalize plus casefold) as a separate API. Internally, it can be > implemented with that extra flag, but even for a private API, I think > it's better to make each function do one thing.
After giving it more thought, I agree that a separate API is a better choice here. For now, I added svn_utf__casefold() in r1732152. > Instead of relying on the Unicode spec, I propose a different approach: > to treat accented letters as if they don't have diacriticals at all. > This should be fairly easy to do with utf8proc: in the intermediate, > 32-bit NFD string, remove any character that's in the > combining-diacritical group, and then convert the result to NFC UTF-8. > I've done this before with fairly good results; it's also much easier to > explain this behaviour to users than to tell them, "read the Unicode spec". I see that utf8proc has UTF8PROC_STRIPMARK flag that does something similar to what you describe. The difference is that this option strips the codepoints that fall into either Mn (Nonspacing_Mark), Mc (Spacing_Mark) or Me (Enclosing_Mark) categories [1]. Although that's more than just removing the characters that are marked as Combining Diacritical Marks [2,3,4,5], I am thinking that we could just use this flag. How does this cope with what you propose? Another question is about exposing this ability in the API. I'd say that we could do something like this: svn_utf__transform(svn_boolean_t normalize, svn_boolean_t casefold, svn_boolean_t remove_diacritics) (or maybe svn_utf__map / svn_utf__alter / svn_utf__fold?) Do you have an opinion or suggestions about that? [1] http://www.unicode.org/Public/UNIDATA/UnicodeData.txt [2] http://www.unicode.org/charts/PDF/U0300.pdf [3] http://www.unicode.org/charts/PDF/U1AB0.pdf [4] http://www.unicode.org/charts/PDF/U1DC0.pdf [5] http://www.unicode.org/charts/PDF/U20D0.pdf Regards, Evgeny Kotkov