Re: Character (or byte?) escapes under utf8 pragma

Juerd Waalboer Thu, 11 Mar 2010 04:11:25 -0800

Michael Ludwig skribis 2010-03-10 10:34 (+0100):
> Okay. Let me try to see if I have understood correctly. Without the utf8
> pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
> of two bytes in my source code will be stored internally as a sequence
> of 12 integers. With the utf8 pragma in scope, only 11 integers.


"so\xa0ein\xa0Käse" must be stored as either:

    l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)

or:

    u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)

Both strings should be semantically equal, and have 11 characters, each
of which has an integer ordinal value.

What happens is the following:

    73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
          l1          l1     u8

This is wrong. It is a bug.
-- 
Met vriendelijke groet, // Kind regards, // Korajn salutojn,

Juerd Waalboer  <ju...@tnx.nl>
TNX

Re: Character (or byte?) escapes under utf8 pragma

Reply via email to