Re: AL32UTF8

Larry Wall Fri, 30 Apr 2004 02:19:21 -0700

On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
: Tim Bunce wrote:
: 
: > Am I right in thinking that perl's internal utf8 representation
: > represents surrogates as a single (4 byte) code point and not as
: > two separate code points?
: 
: Mmmh.  Right and wrong... as a single code point, yes, since the real
: UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
: 3 bytes.


No, Tim's right--they're four bytes.  It's only the individual
surrogates that would come out to three bytes.  The break between
three and four bytes is between \x{ffff} and \x{10000}.

Larry

Re: AL32UTF8

Reply via email to