apologies for possibly breaking the discussion thread
I lost the original message

thanks for the detailed test

this boils down to an endian issue with ast iconv and the BOM (byte order mark)
depending on how the charset is spelled: UTF-16 vs UTF16

this shows the difference
$ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1        
0000000 ff fe 78 00 ac 20 78 00 0a 00
0000012
$printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1 
0000000 78 e2 82 ac 78 0a
0000006

when spelled UTF16 the BOM is omitted
with the BOM omitted ast iconv defaults to the native byte order
and gnu iconv defaults to big-endian -- thus the garbled results
when ast and gnu are mixed

the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code
that confused UTF-16 and UCS-2

the standard supports { UTF-16 UTF-16BE UTF-16LE }
the BE and LE forms force big-endian or little-endian byte ordering
absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read
on write the LE form can be the default if a BOM is included (not sure if this
part is required by the standard, but gnu, and in the next alpha ast, do this)

this will be fixed in the next alpha

_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users

Reply via email to