encoding vs charset

Leopold Toetsch Tue, 15 Jul 2008 14:17:38 -0700

Hi,

I just saw that and such (too late) at #parrotsketch:


  21:52 < NotFound> So unicode:"\xab" and utf8:unicode:"\xab" is also the same 
result?

In my opinion and (AFAIK still in the implementation) is it that the encoding 
bit of PIR is how the possibly escaped bytes are specifying the codepoint in 
the _scource code_. That codepoint will then belong to some charset. Alas the 
above example is illegal.

The source encoding of that mentioned file t/op/stringu.t is utf8:

:set fenc?
  fileencoding=utf-8

pasm_output_is( <<'CODE', <<OUTPUT, "UTF8 literals" );
    set S0, utf8:unicode:"Â«"

and ...

pasm_output_is( <<'CODE', <<OUTPUT, "UTF8 literals" );
    set S0, utf8:unicode:"\xc2\xab"

this is valid UTF8 encoding too, as there is no collision between escaped and 
non-escaped UTF8 chars.

unicode:"\ab" is illegal as there is no such encoding in unicode that would 
make this a codepoint (the more that the default encoding of charset unicode 
is utf8). Or IOW if this were valid than the escaped char syntax would be 
ambiguous.

21:51 < pmichaud> so   unicode:"«"   and unicode:"\xab"  would produce exactly 
the same result.
21:51 < pmichaud> even down to being the same .pbc output.
21:51 < allison> pmichaud: exactly

The former is a valid char in an UTF8/iso-8859-1 encoded source file and only 
there, while the latter is a single invalid UTF8 char part. How would you 
interpret unicode:"\xab\x65" then?

I think that there is still some confusion between the encoding of source code 
with the desired meaning in the charset and the internal encoding of parrot, 
which might be UCS2 or anything.

my 2 ¢
leo

encoding vs charset

Reply via email to