Re: UTF-16 and (ice-9 rdelim)

Mike Gran Mon, 18 Jan 2010 13:40:42 -0800

> From: Neil Jerram
> Hi Mike,

> > But, if you just want to get rid of a BOM, you can cut it down to 
> > a rule.  If the first code point that a port reads is U+FEFF and if the
> > encoding has the string "utf" in it, ignore it.  If the first code point
> > is U+FFFE and the encoding has "utf" in it, flag an error.
> 
> Agreed.
> 
> Out of interest, does that mean that iconv will auto-detect the
> endianness if the encoding does not explicitly say "le" or "be"?


The Unicode FAQ from unicode.org says that "the unmarked form (UTF-16, UTF-32)
uses big-endian byte serialization by default, but may include a byte order
mark at the beginning to indicate the actual byte serialization used."  So,
I guess the strictly correct thing to do for UTF-16 would be to

* check for a BOM.  
* if it exists
  *  if it is U+FFFE, modify the port encoding to UTF-16-LE
  *  if it is U+FEFF, leave the port encoding as UTF-16
  *  discard the BOM
* else, leave the port-encoding to UTF-16

and similarly for UTF-32.

Thanks,
- Mike

Re: UTF-16 and (ice-9 rdelim)

Reply via email to