Re: [ast-developers] Bug: Incorrect output of unicode chars in structured variables

Bernd Eggink Sun, 28 Oct 2007 15:07:36 -0800

Glenn Fowler schrieb:

On Sun, 28 Oct 2007 12:17:39 +0100 Bernd Eggink wrote:

Both $'\xe4' and $'\303\244' render as a-umlaut with LANG=de_DE.UTF-8.
Only $'\303\244' does so with LANG=C.

I'm not quite convinced that this isn't a ksh issue...
Thanks anyway!
Confused,
Bernd


confused here too
you and I aren't getting off that easy

part of the problem is that I'm usually in LANG=C
and am oblivious to many of the locale subtleties
so the following analysis could be off base
please jump in and correct any errors in the logic

I do know this much about utf-8 encoding
the leftmost 1-bits in each utf-8 byte specify the number
of current and remaining bytes to make up the encoded character


Actually it's a bit more complicated, see here:

 http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

is there a tty setting that says "if its not utf-8, try 8-bit ascii"?


Not that I know of, but the tty appears to behave so.

As to the original problem: I tried some other sequences, such as

$ a=(name='1ä')         # 1 a-umlaut
$ print ${a.name}
1ä                      # correct: 1 a-umlaut
$ print $a
( name=$'1\344' )       # Huh?
$ print "$a"
(
        name=$'1\344'
)

0344 or 0xe4 is the unicode value of a-umlaut, the correct utf-8 encodingwould be 0xc3a4. The results suggest that, under certain circumstances, theoutput encoding of structured values is incorrect or simply missing.


Regards,
Bernd

--
Bernd Eggink
[EMAIL PROTECTED]
http://sudrala.de
_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Bug: Incorrect output of unicode chars in structured variables

Reply via email to