On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote: : Tim Bunce wrote: : : > Am I right in thinking that perl's internal utf8 representation : > represents surrogates as a single (4 byte) code point and not as : > two separate code points? : : Mmmh. Right and wrong... as a single code point, yes, since the real : UTF-8 doesn't do surrogates which are only a UTF-16 thing. 4 bytes, no, : 3 bytes.
No, Tim's right--they're four bytes. It's only the individual surrogates that would come out to three bytes. The break between three and four bytes is between \x{ffff} and \x{10000}. Larry