Hi Mike, Many thanks for your quick response. I'll hopefully work on these fixes shortly.
A few comments... Mike Gran <[email protected]> writes: > This should work. BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8, > and UTF-32 respectively. And if the port encoding is expected to be > set correctly in the first place, a BOM should always be the first > code point returned by read-char. Thanks. For the moment, I am assuming that the encoding will have previously been declared correctly, by `set-port-encoding' or by a `coding:' comment. > If you already have to go to the trouble of converting to u32, it might > be simplest to reimplement the non-Latin-1 case in Scheme, > since read-char and unread-char should work even for UTF-16. > That might do bad things to speed, though. I'll have a look; it's nice to prototype that way, at least. > There are a couple of issues here. If you want a port to automatically > identify a Unicode encoding by checking its first four bytes for a BOM, > then you would need some sort of association table. It wouldn't be that > hard to do. I'm not thinking of that yet. (For the future, clearly it must be possible, as Emacs is doing it all the time.) > But, if you just want to get rid of a BOM, you can cut it down to > a rule. If the first code point that a port reads is U+FEFF and if the > encoding has the string "utf" in it, ignore it. If the first code point > is U+FFFE and the encoding has "utf" in it, flag an error. Agreed. Out of interest, does that mean that iconv will auto-detect the endianness if the encoding does not explicitly say "le" or "be"? Regards, Neil
