Stas Bekman wrote:
Steve Hay wrote:
Hi,
I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi.
It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache->print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding.
[the rest of the very detailed analysis has been snipped]
5.8.0 is a pretty new perl version, which provides the new functionality, and it seems that hardly anybody has been using the UTF stuff with mod_perl.
5.8.0 is actually a couple of days short of being one year old (happy birthday!), which is increasingly not that new any more. 5.8.1 should be out soon too.
As for hardly anybody using UTF8 stuff with mod_perl... I didn't think that I was until I realised that most XML parsers (certainly the two that I most uses -- XML::LibXML and XML::DOM) return all their data in Perl's internal UTF-8 format! Then the penny dropped that I was actually using it rather a lot :-)
So I suppose you are the first one to hit the problem. Certainly we need to update mod_perl to handle this correctly. Would you be interested to try to make Apache->print() do the right thing?
Hmm. We really need somebody who understands the internals of Perl and mod_perl better than me, but here's a first stab at it:
The Perl source code contains a pp_print() function in "pp_hot.c" which I presume is basically CORE::print(). It makes use of a do_print() function. I think that function comes from "doio.c", although it's actually called Perl_do_print() there. That function does some stuff with the UTF-8 flag, which I guess is the sort of thing that we're after. Here's a chunk of Perl_do_print() from Perl 5.8.0:
if (PerlIO_isutf8(fp)) { if (!SvUTF8(sv)) sv_utf8_upgrade(sv = sv_mortalcopy(sv)); } else if (DO_UTF8(sv)) { if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) && ckWARN_d(WARN_UTF8)) { Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print"); } }
I think what this does is look to see if the fp (a PerlIO *) has the ":utf8" encoding layer. If so, then it upgrades the sv to UTF8 (which is always possible). If not, then it looks to see if the "bytes" pragma is enabled. If not, then it downgrades the sv from UTF8 (which is not always possible -- if that fails and the UTF8 warnings category is enabled then it outputs the good ol' "Wide character in print" warning).
I have attempted to shoe-horn this into mod_perl's print() method (in "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that "if" block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).]
--- Apache.xs.orig 2003-06-06 12:31:10.000000000 +0100 +++ Apache.xs 2003-07-15 12:20:42.000000000 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE); + /*PerlIO *fp = PerlIO_stdout();*/
if(items > 2) do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1));
+ /*if (PerlIO_isutf8(fp)) { + if (!SvUTF8(sv)) + sv_utf8_upgrade(sv = sv_mortalcopy(sv)); + } + else*/ if (DO_UTF8(sv)) { + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + && ckWARN_d(WARN_UTF8)) + { + Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print"); + } + } + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv);
Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if "items > 2", do we need to handle the UTF8-ness of each of those items individually before we join them?
- we need to code this in such a way as to remain backwards compatible with older Perls.
Anyway, it's a start.
If not, we should log it in the STATUS file and hopefully someone will have the time and inclination to solve it.
Hopefully the above stab at it will encourage somebody to have a serious look.
In any case a simple test that reproduces the problem will be needed.
This test program reproduces the problem:
use 5.008; use Encode; print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');
Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c) and you get:
0000000 374 0000001
Try the same from an Apache/mod_perl setup and you get:
0000000 303 274 0000002
i.e. the double-byte UTF-8 character representing ü that has been output is converted back to ü by Perl's print() [ü is character 252, octal 374], but is left as the two bytes by Apache's print().
I've actually re-built my mod_perl using the half-formed patch given above and it fixes this particular test case!
Steve