Hi,
I just saw that and such (too late) at #parrotsketch:
21:52 < NotFound> So unicode:"\xab" and utf8:unicode:"\xab" is also the same
result?
In my opinion and (AFAIK still in the implementation) is it that the encoding
bit of PIR is how the possibly escaped bytes are specifying the codepoint in
the _scource code_. That codepoint will then belong to some charset. Alas the
above example is illegal.
The source encoding of that mentioned file t/op/stringu.t is utf8:
:set fenc?
fileencoding=utf-8
pasm_output_is( <<'CODE', <<OUTPUT, "UTF8 literals" );
set S0, utf8:unicode:"«"
and ...
pasm_output_is( <<'CODE', <<OUTPUT, "UTF8 literals" );
set S0, utf8:unicode:"\xc2\xab"
this is valid UTF8 encoding too, as there is no collision between escaped and
non-escaped UTF8 chars.
unicode:"\ab" is illegal as there is no such encoding in unicode that would
make this a codepoint (the more that the default encoding of charset unicode
is utf8). Or IOW if this were valid than the escaped char syntax would be
ambiguous.
21:51 < pmichaud> so unicode:"«" and unicode:"\xab" would produce exactly
the same result.
21:51 < pmichaud> even down to being the same .pbc output.
21:51 < allison> pmichaud: exactly
The former is a valid char in an UTF8/iso-8859-1 encoded source file and only
there, while the latter is a single invalid UTF8 char part. How would you
interpret unicode:"\xab\x65" then?
I think that there is still some confusion between the encoding of source code
with the desired meaning in the charset and the internal encoding of parrot,
which might be UCS2 or anything.
my 2 ¢
leo