Tim Bunce wrote:

> Am I right in thinking that perl's internal utf8 representation
> represents surrogates as a single (4 byte) code point and not as
> two separate code points?

Mmmh.  Right and wrong... as a single code point, yes, since the real
UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
3 bytes.

> This is the form that Oracle call AL32UTF8.

Does this

http://www.unicode.org/reports/tr26/

look like like Oracle's older (?) UTF8?

> What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> byte string that used surrogates? Would there be problems?

You would get out the surrogate code points from the sv, not the
supplementary plane code point the surrogate pairs are encoding.
Depends what you do with the data: this might be okay, might not.
Since it's valid UTF-8, nothing should croak perl-side.

> (For example, a string returned from Oracle when using the UTF8
> character set instead of the newer AL32UTF8 one.)
> 
> Tim.

Reply via email to