On Thu, 27 Sep 2012 02:43:42 +0200 Roland Mainz wrote: > On Thu, Sep 27, 2012 at 1:27 AM, Glenn Fowler <[email protected]> wrote: > > > > apologies for possibly breaking the discussion thread > > I lost the original message > > > > thanks for the detailed test > > > > this boils down to an endian issue with ast iconv and the BOM (byte order > > mark) > > depending on how the charset is spelled: UTF-16 vs UTF16
> Mhhh... see http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes > ... IMO it would be nice to have a warning in iconv --man abou that. I > recall writing tests for "read" (the "builtin_read.sh" module) and ran > into hideous brain-eating issues with using AST iconv in EUC/ShiftJIS > locales vs. UTF-16 Maybe even the same issue as this one... > > this shows the difference > > $ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1 > > 0000000 ff fe 78 00 ac 20 78 00 0a 00 > > 0000012 > > $printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1 > > 0000000 78 e2 82 ac 78 0a > > 0000006 > > > > when spelled UTF16 the BOM is omitted > > with the BOM omitted ast iconv defaults to the native byte order > > and gnu iconv defaults to big-endian -- thus the garbled results > > when ast and gnu are mixed > Mhhh... I guess GNU iconv prefers network byte order... which is > big-endian (fortunately... :-) > > the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code > > that confused UTF-16 and UCS-2 > UTF-16 is basically variable-width and can represent any Unicode > character, even those outside the "BMP"(=Basic Multilinguar Plane). > UCS2 is the abomination which can only represent code points 0-65535 > (and things have been tacked-on in weired fashion to allow access to > characters outside the BMP) ... > > the standard supports { UTF-16 UTF-16BE UTF-16LE } > > the BE and LE forms force big-endian or little-endian byte ordering > > absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read > > on write the LE form can be the default if a BOM is included (not sure if > > this > > part is required by the standard, but gnu, and in the next alpha ast, do > > this) > > > > this will be fixed in the next alpha the point of my post may have been missed ast iconv behavior is incorrect it will be fixed in the next alpha > Erm... are you putting "iconv" into libcmd ? If "yes" then I'll > contribute a "builtin_iconv.sh" test module ([1]) ... :-) the move to libcmd will happen in a future alpha _______________________________________________ ast-users mailing list [email protected] https://mailman.research.att.com/mailman/listinfo/ast-users
