You don't say what Perl version you're using.

Unicode on 5.6.x is a bad plan. Avoid

5.8.x should have all of the bugs fixed. In practise, it has most.
However, there is one fundamental design decision that's arguably wrong -
there is an assumption that any data stored internally as 8 bit (ie read
from a file handle or created from XS via SvPV...) is assumed to ISO-8859-1
if ever it meets a chr > 255.

This also works in reverse - by default (as of 5.8.1 - ignore 5.8.0 for now
and upgrade if you're using it - many bugs were fixed by 5.8.1)
by default file handles are assumed to be "8 bit" where "8 bit" is taken
to be interchangeable with ISO-8859-1.


Irrespective of the rightness or the wrongness of this "8 bit means
ISO-8859-1" decision, it's not possible to change it now.

On Tue, Mar 29, 2005 at 09:03:22PM -0800, William Ahern wrote:

> I create a new Perl string using newPVsvn. Now, here's the wierd part. If I
> use SvUTF8_on, the string is somehow downgraded back to iso-8859-1 sometime
> between returning and printing anywhere. utf::is_utf8 says yes, but when
> printed its downgraded.

This is correct, as follows:

By calling SvUTF8_on you flag to Perl that your buffer contains UTF-8, and
from here on the Perl language will see it as Unicode characters.
It happens that they are stored as UTF-8, but that's an implementation
detail as far as the language side goes. It happens to be an implementation
detail that everyone here is aware of, because we're coding in XS, not
pure Perl.

When Perl comes to output that string, it knows that the filehandle is
8 bit (/ISO-8859-1), and so knows that it must attempt to output ISO-8859-1
characters. So it does. (Hence the "downgrade" that you're seeing - it's
all the same characters, just in different encodings)

> If I *don't* use SvUTF8_on, and utf::is_utf8 says no, the string is printed
> as-as; that is, as UTF-8. Everything seems to work.

This is correct. Perl is unaware that the sequence of bytes happens to be
valid UTF8. It could be valid Big 5, or whatever. It's printed out as is.

> Now, if I were to convert a Big5 string to UTF-8, everything works as it is
> supposed to using SvUTF8_on. It seems that Perl is for some reason deciding
> to down convert behind my back only with some strings which can cleanly
> downgrade.

Correct.

In 5.8.x you will get a warning if the attempt to downgrade fails.
To stop the downgrade, mark the file handle as utf8 with
binmode FH, "utf8";

(5.8.x and later. 5.6.x is only "marketing compatible" with Unicode. One
reason bing that there's no correct way to do UTF-8 output)


> I thought it might have been an output issue. But in fact, I can use
> utf8::encode() on the scalar to encode, and it then prints as UTF-8. Again,
> none of this is needed if I have a UTF-8 string which is unable to downgrade
> to ISO-8859-1. But if I _do_ use utf8::encode on an Big5->UTF-8 conversion
> which has the UTF-8 flag set using SvUTF8_on, it also works (doesn't double
> encode).
> 
> Am I doing something wrong? What is the proper way to create a UTF-8
> string--from UTF-8 source--in perl xs?

You're doing the correct thing to create a Unicode string.

You need to use binmode to flag that the filehandle was expecting UTF-8
output.

Nicholas Clark

Reply via email to