On Thu, Sep 27, 2012 at 1:27 AM, Glenn Fowler <[email protected]> wrote:
>
> apologies for possibly breaking the discussion thread
> I lost the original message
>
> thanks for the detailed test
>
> this boils down to an endian issue with ast iconv and the BOM (byte order 
> mark)
> depending on how the charset is spelled: UTF-16 vs UTF16

Mhhh... see http://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
... IMO it would be nice to have a warning in iconv --man abou that. I
recall writing tests for "read" (the "builtin_read.sh" module) and ran
into hideous brain-eating issues with using AST iconv in EUC/ShiftJIS
locales vs. UTF-16 Maybe even the same issue as this one...

> this shows the difference
> $ printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF-16 | od -tx1
> 0000000 ff fe 78 00 ac 20 78 00 0a 00
> 0000012
> $printf 'x\u[20ac]x\n' | iconv -f UTF-8 -t UTF16 | od -tx1
> 0000000 78 e2 82 ac 78 0a
> 0000006
>
> when spelled UTF16 the BOM is omitted
> with the BOM omitted ast iconv defaults to the native byte order
> and gnu iconv defaults to big-endian -- thus the garbled results
> when ast and gnu are mixed

Mhhh... I guess GNU iconv prefers network byte order... which is
big-endian (fortunately... :-)

> the reason for the UTF16 vs UTF-16 diff is a bug in the _ast_iconv() code
> that confused UTF-16 and UCS-2

UTF-16 is basically variable-width and can represent any Unicode
character, even those outside the "BMP"(=Basic Multilinguar Plane).
UCS2 is the abomination which can only represent code points 0-65535
(and things have been tacked-on in weired fashion to allow access to
characters outside the BMP) ...

> the standard supports { UTF-16 UTF-16BE UTF-16LE }
> the BE and LE forms force big-endian or little-endian byte ordering
> absent the BOM ( fffe or feff ) the UTF-16 form should default to BE on read
> on write the LE form can be the default if a BOM is included (not sure if this
> part is required by the standard, but gnu, and in the next alpha ast, do this)
>
> this will be fixed in the next alpha

Erm... are you putting "iconv" into libcmd ? If "yes" then I'll
contribute a "builtin_iconv.sh" test module ([1]) ... :-)

[1]=("Yes", I know... I still own David the test modules for
"types_nameref" and the "i18n_japanese" thing... and a test module for
the native libshell nval API etc. etc. ... ;-( )

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)
_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users

Reply via email to