Am Mon, 30 May 2016 17:14:47 +0000 schrieb Andrew Godfrey <[email protected]>:
> I like "make string iteration explicit" but I wonder about other > constructs. E.g. What about "sort an array of strings"? How would > you tell a generic sort function whether you want it to interpret > strings by code unit vs code point vs grapheme? You are just scratching the surface! Unicode strings are sorted following the Unicode Collation Algorithm which is described in the 86 pages document here: (http://www.unicode.org/reports/tr10/) which is implemented in the ICU library mentioned before. Some obvious considerations from the description of the algorithm: In Sweden z comes before ö, while in Germany its the reverse. In Germany, words in a dictionary are sorted differently from lists of names in a phone book. dictionary: of < öf, phone book: öf < of Spanish sorts 'll' as one character right after 'l'. The default collation is selected in Windows through the control panel's localization app and on Linux (Posix) using the LC_COLLATE environment variable. The actual string sorting in the user's locale can then be performed with the C library using http://www.cplusplus.com/reference/cstring/strcoll/ or OS specific functions like CompareStringEx on Windows https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx TL;DR neither code-points nor grapheme clusters are adequate for string sorting. Also two strings may compare unequal byte for byte, while they are actually the same text in different normalization forms. (E.g. Umlauts on OS X (NFD) vs. rest of the world (NFC)). Admittedly I find myself using str1 == str2 without first normalizing both, because it is frigging convenient and fast. -- Marco
