On Tue, Mar 18, 2008 at 4:50 PM, John Cowan <[EMAIL PROTECTED]> wrote: > I'm not arguing that point. I'm arguing that there should be two > different kinds of strings, one of which is UTF-8 and one of which > contains single-byte characters in an unspecified ASCII-compatible > encoding, *and that the Scheme core ought to know the difference*.
Maybe so. But you would want the usual string operations to work with either kind of string, right? (Alex wrote about Gauche, sounds like they did a good job with that) > IMHO, UTF-8 BOMs are a baaaad idea, but that's another debate. Right, we talked a bit about that last September... not sure I yet see why it is baaad but maybe could be considered unnecessary. It could follow from the general principle of separating metadata from data: Put the encoding in the extended attributes of the file, or resource fork if you've got one. Maybe when Windows is dead, memory buses have attribute bits for every word, thumb drives ship preformatted with ReiserFS v5 (optimized for holographic storage), and tar can archive extended attributes alongside the files, the BOM could be retired completely. :-) > That turns out not to be the case. Start Notepad and paste in some > random non-ASCII stuff from the Web, do a Save, and see what you get by > default (or in earlier versions of Windows, whether you like it or not). > You get little-endian UTF-16 with an explicit BOM. Ookie. > FWIU, the main reason is so that Strings can be safely passed between > threads. Yeah you're probably right. > Chicken characters are 24 bits, which is enough to handle the Unicode > range of U+0000 to U+10FFFF (fits in, but does not fill up, 21 bits). Cool. > Not really, since in UTF-16 some characters are two code units long. > Java made that mistake, now partly rectified. I thought it was still a reasonable assumption most of the time, except for the extra few languages that required extending Unicode beyond 16 bits? There could be a bit somewhere to indicate whether the string has any of those characters... but then you'd have to find out whether it does or not, in order to set the bit. Or have 4 types of strings: byte (restricted strings), UTF-8, and fixed-char-size 16- and 24-bit strings. The latter two can be in a unicode egg. The fixed-16-bit type would be useful often enough, and save memory a lot of the time. That type could be converted to fixed-24-bit type automatically, only when necessary (when setting a larger character into the string, when reading from a UTF-8 string that has the larger characters, etc.) From a user's perspective both of those fixed-char-size types are the same: a string that has O(1) access by index. But converting from UTF-8 to/from the fixed-char-size form would have to be explicit, because UTF-8 is the "native" chicken type, and only in some cases do you need the O(1) string-ref etc. Nevertheless the usual string operations could still do the right thing with all 4 types, right? (Oh well, it all sounds like too much work to implement, doesn't it?) _______________________________________________ Chicken-users mailing list Chicken-users@nongnu.org http://lists.nongnu.org/mailman/listinfo/chicken-users