Mike Gran <spk...@yahoo.com> writes: >> The reason I am still arguing this point is because I have looked >> seriously at what I would need to do to (A) fix our i18n problems and >> (B) make the code efficient. I very much want to fix these things, >> but the pain of trying to do this with our current scheme is too much >> for me to bear. I shouldn't have to rewrite libunistring, and I >> shouldn't have to write 3 or 4 different variants of each procedure >> that takes two string parameters. > > What procedures are giving incorrect results?
I know of two categories of bugs. One has to do with case conversions and case-insensitive comparisons, which must be done on entire strings but are currently done for each character. Here are some examples: (string-upcase "Straße") => "STRAßE" (should be "STRASSE") (string-downcase "ΧΑΟΣΣ") => "χαοσσ" (should be "χαoσς") (string-downcase "ΧΑΟΣ Σ") => "χαοσ σ" (should be "χαoς σ") (string-ci=? "Straße" "Strasse") => #f (should be #t) (string-ci=? "ΧΑΟΣ" "χαoσ") => #f (should be #t) Another big category of problems has to do with the fact that scm_from_locale_{string,symbol,keyword} is currently used in many places where the C string being converted is a compile-time constant. This is a bug unless the strings are ASCII-only, because the locale is normally that of the user, which is not necessarily that of the source code. Ludovic, Andy and I discussed this on IRC, and came to the conclusion that UTF-8 should be the encoding assumed by functions such as scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic, scm_c_export, scm_c_define_module, scm_c_resolve_module, scm_c_use_module, etc. However, this creates pressure to make scm_from_utf8_string and scm_from_utf8_symbol as efficient as possible. With the current string representation scheme, the plan for scm_from_utf8_string is to scan up to the first 100 characters of the input string, and if the string is found to be ASCII-only, then we can use scm_from_latin1_string. Otherwise, we need to use scm_from_stringn which is noticeably slower. An unfortunate complication is that the snarfing macros such as SCM_DEFINE et al arrange to store the symbol names as compile-time constants and thus to put them in a read-only segment of the shared library. This is done with some preprocessor magic in snarf.h (see SCM_IMMUTABLE_STRINGBUF). I would like to make SCM_DEFINE et al work for any UTF-8 strings, but I can do that with cpp only if UTF-8 is the internal representation. As things currently stand, those macros must be limited to ASCII-only names, which is unfair to non-English speakers. Mark