Hi, Mike Gran <spk...@yahoo.com> writes:
>> From:Ludovic Courtès <l...@gnu.org> > >> > I know of two categories of bugs. One has to do with case conversions >> > and case-insensitive comparisons, which must be done on entire strings >> > but are currently done for each character. Here are some examples: >> > >> > (string-upcase "Straße") => "STRAßE" >> (should be "STRASSE") >> > (string-downcase "ΧΑΟΣΣ") => "χαοσσ" >> (should be "χαoσς") >> > (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" >> (should be "χαoς σ") >> > (string-ci=? "Straße" "Strasse") => #f >> (should be #t) >> > (string-ci=? "ΧΑΟΣ" "χαoσ") => #f >> (should be #t) >> >> (Mike pointed out that SRFI-13 does not consider these bugs, but that’s >> linguistically wrong so I’d consider it a bug. Note that all these >> functions are ‘linguistically buggy’ anyway since they don’t have a >> locale argument, which breaks with Turkish ‘İ’.) >> >> Can we first check what would need to be done to fix this in 2.0.x? >> >> At first glance: >> >> - “Straße” is normally stored as a Latin1 string, so it would need to >> be converted to UTF-* before it can be passed to one of the >> unicase.h functions. *Or*, we could check with bug-libunistring >> what it would take to add Latin1 string case mapping functions. >> >> Interestingly, ‘ß’ is the only Latin1 character that doesn’t have a >> one-to-one case mapping. All other Latin1 strings can be handled by >> iterating over characters, as is currently done. > > There is the micro sign, which, when case folded, becomes a Greek mu. > It is still a single character, but, it is the only latin-1 character that, > when folded, becomes a non-Latin-1 character Blech. It would have worked better with narrow == ASCII instead of narrow == Latin1. It’s a change we can still make, I think. >> - Case insensitive comparison is more difficult, as you already >> pointed out. To do it right we’d probably need to convert Latin1 >> strings to UTF-32 and then pass it to u32_casecmp. We don’t have to >> do the conversion every time, though: we could just change Latin1 >> strings in-place so they now point to a wide stringbuf upon the >> first ‘string-ci=’. >> >> Thoughts? > > What about the srfi-13 case insensitive comparisons (the ones that don't > terminate in question marks, like string-ci<)? Should they remain > as srfi-13 suggests, or should they remain similar in behavior > to the question-mark-terminated comparisons? Well, if maintaining two string comparison algorithms is reasonable, then we can keep both; otherwise, I’d vote for the R6RS way. > Mark is right that fixing this will not be pretty. The case insensitive > string comparisons, for example, could be patched like the attached > snippet. If you don't find it too ugly of an approach, I could work on > a real patch. Indeed it’s quite inelegant. ;-) How about changing to narrow == ASCII and then string comparison would be: if (narrow (s1) != narrow (s2)) { /* Handle ß -> ss. */ if (!narrow (s1)) widify (s1); else widify (s2); } if (narrow (s1)) /* S1 and S2 are ASCII. */ return strcmp (char_data (s1), char_data (s2)); else /* S1 and S2 are UTF-32. */ return u32_cmp (wide_char_data (s1), wide_char_data (s2)); Looks like that would remain reasonable while actually fixing our problems. As a side-effect, though, scm_from_latin1_locale would become slightly less efficient because it’d need to check for non-ASCII chars. Thanks, Ludo’.