On Thu, Sep 27, 2012 at 1:27 AM, Glenn Fowler <[email protected]> wrote: > > apologies for possibly breaking the discussion thread > I lost the original message > > thanks for the detailed test > > this boils down to an endian issue with ast iconv and the BOM (byte order > mark) > depending on how the charset is spelled: UTF-16 vs UTF16
Mhhh... see http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes ... IMO it would be nice to have a warning in iconv --man abou that. I recall writing tests for "read" (the "builtin_read.sh" module) and ran into hideous brain-eating issues with using AST iconv in EUC/ShiftJIS locales vs. UTF-16 Maybe even the same issue as this one... > this shows the difference > $ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1 > 0000000 ff fe 78 00 ac 20 78 00 0a 00 > 0000012 > $printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1 > 0000000 78 e2 82 ac 78 0a > 0000006 > > when spelled UTF16 the BOM is omitted > with the BOM omitted ast iconv defaults to the native byte order > and gnu iconv defaults to big-endian -- thus the garbled results > when ast and gnu are mixed Mhhh... I guess GNU iconv prefers network byte order... which is big-endian (fortunately... :-) > the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code > that confused UTF-16 and UCS-2 UTF-16 is basically variable-width and can represent any Unicode character, even those outside the "BMP"(=Basic Multilinguar Plane). UCS2 is the abomination which can only represent code points 0-65535 (and things have been tacked-on in weired fashion to allow access to characters outside the BMP) ... > the standard supports { UTF-16 UTF-16BE UTF-16LE } > the BE and LE forms force big-endian or little-endian byte ordering > absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read > on write the LE form can be the default if a BOM is included (not sure if this > part is required by the standard, but gnu, and in the next alpha ast, do this) > > this will be fixed in the next alpha Erm... are you putting "iconv" into libcmd ? If "yes" then I'll contribute a "builtin_iconv.sh" test module ([1]) ... :-) [1]=("Yes", I know... I still own David the test modules for "types_nameref" and the "i18n_japanese" thing... and a test module for the native libshell nval API etc. etc. ... ;-( ) ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-users mailing list [email protected] https://mailman.research.att.com/mailman/listinfo/ast-users
