Re: UTF-16 and (ice-9 rdelim)

Neil Jerram Mon, 18 Jan 2010 12:15:09 -0800

Hi Mike,

Many thanks for your quick response.  I'll hopefully work on these fixes
shortly.

A few comments...

Mike Gran <[email protected]> writes:

> This should work.  BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8,
> and UTF-32 respectively.  And if the port encoding is expected to be
> set correctly in the first place, a BOM should always be the first
> code point returned by read-char.

Thanks.  For the moment, I am assuming that the encoding will have
previously been declared correctly, by `set-port-encoding' or by a
`coding:' comment.

> If you already have to go to the trouble of converting to u32, it might
> be simplest to reimplement the non-Latin-1 case in Scheme,
> since read-char and unread-char should work even for UTF-16.
> That might do bad things to speed, though.

I'll have a look; it's nice to prototype that way, at least.

> There are a couple of issues here.  If you want a port to automatically
> identify a Unicode encoding by checking its first four bytes for a BOM, 
> then you would need some sort of association table.  It wouldn't be that
> hard to do.

I'm not thinking of that yet.  (For the future, clearly it must be
possible, as Emacs is doing it all the time.)

> But, if you just want to get rid of a BOM, you can cut it down to 
> a rule.  If the first code point that a port reads is U+FEFF and if the
> encoding has the string "utf" in it, ignore it.  If the first code point
> is U+FFFE and the encoding has "utf" in it, flag an error.

Agreed.

Out of interest, does that mean that iconv will auto-detect the
endianness if the encoding does not explicitly say "le" or "be"?

Regards,
        Neil

Re: UTF-16 and (ice-9 rdelim)

Reply via email to