Re: [Chicken-users] ditching syntax-case modules for the utf8 egg

John Cowan Tue, 18 Mar 2008 20:57:08 -0700

Shawn Rutledge scripsit:

> But you would want the usual string operations to work with either
> kind of string, right?


Indeed.

> It could follow from the general principle of separating metadata from
> data: Put the encoding in the extended attributes of the file, or
> resource fork if you've got one.  

Specifically, the 8-BOM interferes with the ability of ASCII-aware but
8-bit clean programs to treat UTF-8 the same as ASCII.  When they expect
to see something specific (like #!) at the beginning, they see the 8-BOM
instead and barf.

I'm all in favor of the 16-BOM, where there are no such issues, and
it also serves to reliably flag UTF-16/UCS-2 and to allow for variable
endianism.  Same with the 32-BOM, if anyone bothers to use UTF-32 for
interchange.

> I thought it was still a reasonable assumption most of the time,

Except when it isn't.  ASCII is a reasonable assumption most of the time,
except when it isn't.

> Or have 4 types of strings: byte (restricted strings), UTF-8, and
> fixed-char-size 16- and 24-bit strings.  

Check out http://larceny.ccs.neu.edu/larceny-trac/wiki/StringRepresentations ,
then let's talk, if there's anything left to talk about.  :-)

-- 
We are lost, lost.  No name, no business, no Precious, nothing.  Only empty.
Only hungry: yes, we are hungry.  A few little fishes, nassty bony little
fishes, for a poor creature, and they say death.  So wise they are; so just,
so very just.  --Gollum        [EMAIL PROTECTED]  http://ccil.org/~cowan


_______________________________________________
Chicken-users mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] ditching syntax-case modules for the utf8 egg

Reply via email to