Re: utf8::valid and \x14_000 - \x1F_0000

Juerd Waalboer Tue, 11 Mar 2008 14:26:16 -0700

Chris Hall skribis 2008-03-11 21:09 (+0000):
> OK.  In the meantime IMHO chr(n) should be handling utf8 and has no 
> business worrying about things which UTF-8 or UCS think aren't 
> characters.


It should do Unicode, not any specific byte encoding, like UTF-?8.

Internally, a byte encoding is needed. As a programmer I don't want to
be bothered with such implementation details.

> Note that chr(n) is whingeing about 0xFFFE, which Encode::en/decode
> (UTF-8) are happy with.  Unicode defines 0xFFFE and 0xFFFF as 
> non-characters, not just 0xFFFF (which Encode::en/decode do deem 
> invalid).

Personally, I think Perl should accept these characters without warning,
except the strict UTF-8 encoding is requested (which differs from the
non-strict UTF8 encoding).

> >>In any case, is chr(n) supposed to be utf8 or UTF-8 ?  AFAIKS, it's
> >>neither.
> >It's supposed to be neither on the outside. Internally, it's utf8.
> One can turn off the warnings and then chr(n) will happily take any +ve 
> integer and give you the equivalent character -- so the result is utf8, 

The result is Unicode. The difference between Unicode and UTF8 is not
always clear, but in this case is: the character is Unicode, a single
codepoint, the internal implementation is UTF8.

Unicode: U+20AC    (one character: €)
UTF-8:   E2 82 AC  (three bytes)

I am under the impression that you know the difference and made an
honest mistake. My detailed expansion is also for lurkers and archives.

> [replacement character]
> So we'll have to differ on this :-)

Yes, although my opinion on this is not strong. undef or replacement
character - both are good options. One argument in favor of the
replacement character would be backwards compatibility.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <[EMAIL PROTECTED]>  <http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy <[EMAIL PROTECTED]>

Re: utf8::valid and \x14_000 - \x1F_0000

Reply via email to