Michael Ludwig skribis 2010-03-08 15:55 (+0100):
> > Perl does not distinguish between bytes and characters. (...) You
> > cannot tell what kind of data they contain just by looking at them
> > and the UTF8 flag doesn’t tell you either.
> Okay. But unless I'm completely misled, you can tell whether a
> string is supposed to contain characters (<- Encode::decode) or
> bytes (<- Encode::encode)

The result of decode is a character string.

The result of encode is a byte string.

However, apart from looking at the source code and deducing the
intentions of the programmer, there is no way to tell whether a given
string is meant as a character or byte string, simply because there is
no technical representation of this intent in the string or its
metadata.

Note that "characters" are the general case: a string is made of
characters. When every character value fits in a single byte, the string
can be used as a byte string.

> > This is definitely a bug.
> Good. It looked like one to me. Thanks for logging it with the
> Perl maintainers.

This bug forces us to look at the internal encoding and flags to come to
the conclusion that it is indeed a bug. Don't mistake this as a sign
that looking at the internal encoding or flags should ever happen in
actual code. Even if you work around the bug, make sure that you don't
make anything conditional on the current formatting of the string.

Instead, coerce it to whatever you need by using utf8::downgrade or
utf8::upgrade. In your specific case, concatenation of two separate
parts is probably the most sane thing to do.

> >> Am I mistaken in my expectation that while "\xa0" should be
> >> a byte, "\x{a0}" and "\x{00a0}" should be characters?

Yes. These three escapes are supposed to be exactly the same. They
create a U+00A0 character, which happens to be perfectly usable as the
A0 byte when used as such, in a string that doesn't contain any
character greater than U+00FF.

> >> [perlre:]
> >> Unicode characters in the range of 128-255 use two hexadecimal
> >> digits with braces: \x{ab}. Note that this is different than
> >> \xab, which is just a hexadecimal byte with no Unicode
> >> significance.
> The documentation I referred to is outdated. Sorry for that.

Indeed this documentation is wrong. Current documentation, as of Perl
version 5.8.9 (december 2008) no longer has this paragraph.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <##...@juerd.nl>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <sa...@convolution.nl>

Reply via email to