Hello, Mike Gran <spk...@yahoo.com> writes:
> There are 3 good, actively developed solutions of which I am aware. > > 1. Use GNU libc functionality. Encode wide strings as wchar_t. That'd be POSIX functionality, actually. > 2. Use GLib functionality. Encode wide strings as UTF-8. Possibly > give up on O(1). Possibly add indexing information to string to allow > O(1), which might negate the space advantage of UTF-8. Technically, depending on GLib would seem unreasonable to me. :-) BTW, Gnulib has a wealth of modules that could be helpful here: http://www.gnu.org/software/gnulib/MODULES.html#posix_ext_unicode I used a few of them in Guile-R6RS-Libs to implement `string->utf8' and such like. > 3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an > obscure dependency. > > Option 3 is likely a non-starter, because it seems that Guile has > tried to avoid adding new non-GNU dependencies. It is technologically > a great solution, IMHO. At first sight, I'd rather avoid it as a dependency, if that's possible, but that's mostly subjective. > Let's say that a string is a union of either an ASCII char vector or a > wchar_t vector. A "character" then is just a Unicode codepoint. > String-ref returns a wchar_t. This is all in line with R6RS as I > understand it. Yes, that seems easily doable. > There could then be a separate iterator and function set that does > (likely O(n)) operations on the grapheme clusters of strings. A > grapheme cluster is a single written symbol which may be made up of > several codepoints. Unicode Standard Annex #29 describes how to > partition a string into a set of graphemes.[1] Hmm, that seems like a difficult topic. It's not even mentioned in SRFI-13. I suppose it can be addressed at a later stage, possibly by providing a specific API. > There is the problem of systems where wchar_t is 2 bytes instead of 4 > bytes, like Cygwin. For those systems, I'd recommend > restricting functionality to 16-bit characters instead of trying to > add an extra UTF-16 encoding/decoding step. I think there should > always be a complete codepoint in each wchar_t. Agreed. The GNU libc doc concurs (info "(libc) Extended Char Intro"). However, given this limitation, and other potential portability issues, it's still unclear to me whether this would be a good choice. We need to look more closely at what Gnulib has to offer, IMO. Thanks, Ludo'.