apologies for possibly breaking the discussion thread
I lost the original message
thanks for the detailed test
this boils down to an endian issue with ast iconv and the BOM (byte order mark)
depending on how the charset is spelled: UTF-16 vs UTF16
this shows the difference
$ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1
0000000 ff fe 78 00 ac 20 78 00 0a 00
0000012
$printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1
0000000 78 e2 82 ac 78 0a
0000006
when spelled UTF16 the BOM is omitted
with the BOM omitted ast iconv defaults to the native byte order
and gnu iconv defaults to big-endian -- thus the garbled results
when ast and gnu are mixed
the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code
that confused UTF-16 and UCS-2
the standard supports { UTF-16 UTF-16BE UTF-16LE }
the BE and LE forms force big-endian or little-endian byte ordering
absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read
on write the LE form can be the default if a BOM is included (not sure if this
part is required by the standard, but gnu, and in the next alpha ast, do this)
this will be fixed in the next alpha
_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users