Re: Using libunistring for string comparisons et al

Mark H Weaver Tue, 15 Mar 2011 18:12:55 -0700

Mike Gran <spk...@yahoo.com> writes:
>> The reason I am still arguing this point is because I have looked
>> seriously at what I would need to do to (A) fix our i18n problems and
>> (B) make the code efficient.  I very much want to fix these things,
>> but the pain of trying to do this with our current scheme is too much
>> for me to bear.  I shouldn't have to rewrite libunistring, and I
>> shouldn't have to write 3 or 4 different variants of each procedure
>> that takes two string parameters.
>
> What procedures are giving incorrect results?


I know of two categories of bugs.  One has to do with case conversions
and case-insensitive comparisons, which must be done on entire strings
but are currently done for each character.  Here are some examples:

  (string-upcase "Straße")         => "STRAßE"  (should be "STRASSE")
  (string-downcase "ΧΑΟΣΣ")        => "χαοσσ"   (should be "χαoσς")
  (string-downcase "ΧΑΟΣ Σ")       => "χαοσ σ"  (should be "χαoς σ")
  (string-ci=? "Straße" "Strasse") => #f        (should be #t)
  (string-ci=? "ΧΑΟΣ" "χαoσ")      => #f        (should be #t)

Another big category of problems has to do with the fact that
scm_from_locale_{string,symbol,keyword} is currently used in many places
where the C string being converted is a compile-time constant.  This is
a bug unless the strings are ASCII-only, because the locale is normally
that of the user, which is not necessarily that of the source code.

Ludovic, Andy and I discussed this on IRC, and came to the conclusion
that UTF-8 should be the encoding assumed by functions such as
scm_c_define, scm_c_define_gsubr, scm_c_define_gsubr_with_generic,
scm_c_export, scm_c_define_module, scm_c_resolve_module,
scm_c_use_module, etc.  However, this creates pressure to make
scm_from_utf8_string and scm_from_utf8_symbol as efficient as possible.

With the current string representation scheme, the plan for
scm_from_utf8_string is to scan up to the first 100 characters of the
input string, and if the string is found to be ASCII-only, then we can
use scm_from_latin1_string.  Otherwise, we need to use scm_from_stringn
which is noticeably slower.

An unfortunate complication is that the snarfing macros such as
SCM_DEFINE et al arrange to store the symbol names as compile-time
constants and thus to put them in a read-only segment of the shared
library.  This is done with some preprocessor magic in snarf.h (see
SCM_IMMUTABLE_STRINGBUF).  I would like to make SCM_DEFINE et al work
for any UTF-8 strings, but I can do that with cpp only if UTF-8 is the
internal representation.  As things currently stand, those macros must
be limited to ASCII-only names, which is unfair to non-English speakers.

     Mark

Re: Using libunistring for string comparisons et al

Reply via email to