On Thu, 27 Sep 2012 02:43:42 +0200 Roland Mainz wrote:
> On Thu, Sep 27, 2012 at 1:27 AM, Glenn Fowler <[email protected]> wrote:
> >
> > apologies for possibly breaking the discussion thread
> > I lost the original message
> >
> > thanks for the detailed test
> >
> > this boils down to an endian issue with ast iconv and the BOM (byte order 
> > mark)
> > depending on how the charset is spelled: UTF-16 vs UTF16

> Mhhh... see http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
> ... IMO it would be nice to have a warning in iconv --man abou that. I
> recall writing tests for "read" (the "builtin_read.sh" module) and ran
> into hideous brain-eating issues with using AST iconv in EUC/ShiftJIS
> locales vs. UTF-16 Maybe even the same issue as this one...

> > this shows the difference
> > $ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1
> > 0000000 ff fe 78 00 ac 20 78 00 0a 00
> > 0000012
> > $printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1
> > 0000000 78 e2 82 ac 78 0a
> > 0000006
> >
> > when spelled UTF16 the BOM is omitted
> > with the BOM omitted ast iconv defaults to the native byte order
> > and gnu iconv defaults to big-endian -- thus the garbled results
> > when ast and gnu are mixed

> Mhhh... I guess GNU iconv prefers network byte order... which is
> big-endian (fortunately... :-)

> > the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code
> > that confused UTF-16 and UCS-2

> UTF-16 is basically variable-width and can represent any Unicode
> character, even those outside the "BMP"(=Basic Multilinguar Plane).
> UCS2 is the abomination which can only represent code points 0-65535
> (and things have been tacked-on in weired fashion to allow access to
> characters outside the BMP) ...

> > the standard supports { UTF-16 UTF-16BE UTF-16LE }
> > the BE and LE forms force big-endian or little-endian byte ordering
> > absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read
> > on write the LE form can be the default if a BOM is included (not sure if 
> > this
> > part is required by the standard, but gnu, and in the next alpha ast, do 
> > this)
> >
> > this will be fixed in the next alpha

the point of my post may have been missed
ast iconv behavior is incorrect
it will be fixed in the next alpha

> Erm... are you putting "iconv" into libcmd ? If "yes" then I'll
> contribute a "builtin_iconv.sh" test module ([1]) ... :-)

the move to libcmd will happen in a future alpha

_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users

Reply via email to