> From: Neil Jerram > Hi Mike, > > But, if you just want to get rid of a BOM, you can cut it down to > > a rule. If the first code point that a port reads is U+FEFF and if the > > encoding has the string "utf" in it, ignore it. If the first code point > > is U+FFFE and the encoding has "utf" in it, flag an error. > > Agreed. > > Out of interest, does that mean that iconv will auto-detect the > endianness if the encoding does not explicitly say "le" or "be"?
The Unicode FAQ from unicode.org says that "the unmarked form (UTF-16, UTF-32) uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used." So, I guess the strictly correct thing to do for UTF-16 would be to * check for a BOM. * if it exists * if it is U+FFFE, modify the port encoding to UTF-16-LE * if it is U+FEFF, leave the port encoding as UTF-16 * discard the BOM * else, leave the port-encoding to UTF-16 and similarly for UTF-32. Thanks, - Mike
