Re: [ast-developers] Bug: Incorrect output of unicode chars in structured variables

Glenn Fowler Sun, 28 Oct 2007 08:34:12 -0800

On Sun, 28 Oct 2007 12:17:39 +0100 Bernd Eggink wrote:
> Both $'\xe4' and $'\303\244' render as a-umlaut with LANG=de_DE.UTF-8.
> Only $'\303\244' does so with LANG=C.


> I'm not quite convinced that this isn't a ksh issue...
> Thanks anyway!
> Confused,
> Bernd

confused here too
you and I aren't getting off that easy

part of the problem is that I'm usually in LANG=C
and am oblivious to many of the locale subtleties
so the following analysis could be off base
please jump in and correct any errors in the logic

I do know this much about utf-8 encoding
the leftmost 1-bits in each utf-8 byte specify the number
of current and remaining bytes to make up the encoded character

for the one I specified:
$ printf $'%..2u ' 0303 0244; print
11000011 10100100
2        1

for the one you specified, I'm guessing the 8-bit ascii a-umlaut,
$ printf $'%..2u ' 0xe4; print
11100100
3

which for a utf-8 encoded app means "this utf-8 encoding takes up 3 bytes"
but the app is only presented with 1 byte
so there is an encoding error and all bets are off

in particular I see '\xe4' as space, not a-umlaut
is there a tty setting that says "if its not utf-8, try 8-bit ascii"?

--Glenn

_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Bug: Incorrect output of unicode chars in structured variables

Reply via email to