Re: utf8::valid and \x14_000 - \x1F_0000

Juerd Waalboer Wed, 12 Mar 2008 09:53:40 -0700

Chris Hall skribis 2008-03-12 13:20 (+0000):
> >> OK.  In the meantime IMHO chr(n) should be handling utf8 and has no
> >> business worrying about things which UTF-8 or UCS think aren't
> >> characters.
> >It should do Unicode, not any specific byte encoding, like UTF-?8.
> IMHO chr(n) should do characters, which may be interpreted as per
> Unicode, but may not.
> When I said utf8 I was following the (sloppy) convention that utf8 means
> how Perl handles characters in strings...


I'm working hard to break this convention. I've changed a lot of Perl
documentation, and the result was released with Perl 5.10.

If in any place in Perl's official documentation, it still reads UTF-8
or UTF8 for *characters in text strings*, it's wrong. Let me know and I
will fix it :)

>   b. in a Perl string, characters are held in a UTF-8 like form.

I'd say *inside* a Perl string. This is the C implementation, but a Perl
programmer should not have to know the specific *internal* encoding of a
Perl string.

Likewise, in Perl you don't have to know whether your number is
internally encoded as a long integer or a double.

>      Where UTF-8 (upper case, with hyphen) means the RFC 3629 &
>      Unicode Consortium defined byte-wise encoding.

That's the theory, but it's so often not entirely following spec.

>      This form is referred to as utf8 (lower case, no hyphen).

Yes, but note that encoding names in Perl are case insensitive. I tend
to call it UTF8 sometimes.

>      There is really no need to discuss this, except in the context of
>      messing around in guts of Perl.

Exactly.

>      String literals are represented by UCS code points.  Which
>      reinforces the feeling that characters in Perl are Unicode.

Yes!

>      'C' uses 'wide' to refer to characters that may have values
>      > 255.  IMHO it's a shame that Perl did not follow this.

It does in some places, most notably warnings about "wide characters".

>   d. when exchanging character data with other systems one needs to
>      deal with character set and encoding issues.

Not just other systems. All I/O is done in bytes, even with yourself,
for example if you forked.

>             "Isolated surrogate code units have no interpretation on
>              their own."
> (...)
>            Clearly these are illegal in UTF-8.

They have no interpretation, but this also doesn't say it's illegal.

Compare it with the undefined behavior of multiple ++ in a single
expression. There's no specification of what should happen, but it's not
illegal to do it.

>             "Applications are free to use any of these noncharacter code
>              points internally but should never attempt to exchange
>              them.

I think it's not Perl's job to prevent exchange. Simply because the
exchange could be internal, but between processes of the same program.

> I'm puzzled as to why 'UTF-8' (strict) doesn't treat U+FFFE (and
> friends) in the same way as U+FFFF (and friends).

My gut says it's out of ignorance of the "rules", and certainly not an
intentional deviation.

> >The result is Unicode.
> IMHO the result of chr(n) should just be a character.

We call that a unicode character in Perl. It is true that Perl allows
ordinal values outside the currently existing range, but it is still
called unicode by Perl's documentation.

> OK, sure.  I was using utf8 to mean any character value you like, and
> UTF-8 to imply a value which is recognised in UCS -- rather than the
> encoding.

Please use utf8 only for naming the byte encoding that allows any
character value you like, not for the ordinal values themselves.

> FWIW I note that printf "%vX" is suggested as a means to render IPv6
> addresses.  This implies the use of a string containing eight characters
> 0..0xFFFF as the packed form of IPv6.  Building one of those using
> chr(n) will generate spurious warnings about 0xFFFE and 0xFFFF !

Interesting point.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <[EMAIL PROTECTED]>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <[EMAIL PROTECTED]>

Re: utf8::valid and \x14_000 - \x1F_0000

Reply via email to