Re: Undocumented behaviour in Apache-print()?
Steve Hay wrote: Stas Bekman wrote: I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] mod_perl 1.x doesn't use perlio, hence you have this problem. adding: #include perlio.h should resolve it I think. No. The error was unresolved external symbol, which means that the compiler is happy (it evidently has pulled in perlio.h, or something else that declares PerlIO_isutf8() as extern ...), but that the linker couldn't find the definition of that function. (Check: If I change PerlIO_isutf8 to PerlIO_isutf (deliberate typo) then I get a different error - undefined; assuming extern returning int - because now no declaration has been supplied.) Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 is *not* one of them. So where's the definition supposed to come from? I'll ask about this on the perlxs mailing list, I think. Having asked about this, it turns out that the problem was PerlIO_isutf8() not being exported from perl58.lib on Windows (and other platforms where the symbols to export have to be explicitly listed). I sent a patch off to p5p which fixes this (http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-07/msg01096.html), and it has been integrated as #20203. So presumably this will not be a problem in the upcoming perl-5.8.1, but how do we cope with the perl-5.8.0 case? I've attached a new patch (against mod_perl-1.28) which (I believe) fixes the UTF-8 problem, but it won't build on Windows with perl-5.8.0 without #20203. Steve --- Apache.xs.orig 2003-06-06 12:31:10.0 +0100 +++ Apache.xs 2003-07-18 08:47:59.0 +0100 @@ -1119,11 +1119,27 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +#if PERL_VERSION = 8 + PerlIO *fp = IoOFP(GvIOp(defoutgv)); +#endif if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +#if PERL_VERSION = 8 + if (PerlIO_isutf8(fp)) { + if (!SvUTF8(sv)) + sv_utf8_upgrade(sv = sv_mortalcopy(sv)); + } + else if (DO_UTF8(sv)) { + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) +ckWARN_d(WARN_UTF8)) + { + Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); + } + } +#endif PUSHMARK(sp); XPUSHs(rp); @@ -1176,6 +1192,20 @@ int sent = 0; SV *sv = SvROK(ST(i)) (SvTYPE(SvRV(ST(i))) == SVt_PV) ? (SV*)SvRV(ST(i)) : ST(i); +#if PERL_VERSION = 8 + PerlIO *fp = IoOFP(GvIOp(defoutgv)); + if (PerlIO_isutf8(fp)) { + if (!SvUTF8(sv)) + sv_utf8_upgrade(sv = sv_mortalcopy(sv)); + } + else if (DO_UTF8(sv)) { + if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) +ckWARN_d(WARN_UTF8)) + { + Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); + } + } +#endif buffer = SvPV(sv, len); #ifdef APACHE_SSL while(len 0) {
Re: Undocumented behaviour in Apache-print()?
Stas Bekman wrote: Steve Hay wrote: It's only Perl 5.8 that has the special UTF-8 flag which the functions above all operate with respect to. If a Perl variable contains a sequence of bytes that make up a valid UTF-8 character, but the string is not flagged with Perl's special flag, then Perl's built-in print() doesn't do this automatic conversion anyway. Yes. Apps wanting to handle utf will need to 'require 5.008;' as in your example. IOW, print Content-type: text/plain\n\n; $a = \xC3\xBC; print $a; retrieved from a mod_cgi server produces (via od -b / od -c): 000 303 274 002 yup, because you need to add utf8::decode($a); before printing $a. Which your version does as well. (Indeed. I meant it as example of how Perl's (5.8's) print() doesn't do the conversion on strings that are not *flagged* as UTF-8, even when they make valid UTF-8.) Perl 5.6 and older don't have the UTF-8 flag and hence don't do any automatic conversion via print(). Therefore, mod_perl's print() should not have the difference from Perl's print() that exists in 5.8, so no change should be required. Sure enough, looking at the doio.c source file in Perl 5.6.1, the entire chunk of code that I half-inched above is not present. So you suggest that we copy this functionality from Perl. So if need to #ifdef it for 5.8.0. So I'll add #if PERL_VERSION = 8 ... #endif around the code that I've added. I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), mod_perl 1.x doesn't use perlio, hence you have this problem. adding: #include perlio.h should resolve it I think. No. The error was unresolved external symbol, which means that the compiler is happy (it evidently has pulled in perlio.h, or something else that declares PerlIO_isutf8() as extern ...), but that the linker couldn't find the definition of that function. (Check: If I change PerlIO_isutf8 to PerlIO_isutf (deliberate typo) then I get a different error - undefined; assuming extern returning int - because now no declaration has been supplied.) Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 is *not* one of them. So where's the definition supposed to come from? I'll ask about this on the perlxs mailing list, I think. there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? PerlIO *fp = IoOFP(GvIOp(defoutgv)) Seems to work OK for me. What's defoutgv? - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? I'm not sure, how perl handles this? Struggling as best as I can to read pp_print() in Perl's pp_hot.c, it looks like Perl calls do_print() (which contains the UTF-8 handling that I've stolen) for each item in the list that is passed to it. Considering this more, I think that it probably isn't an issue: if you have two variables in Perl, one of which is flagged UTF-8 and the other of which isn't, then when you concatenate them, the whole is upgraded to flagged UTF-8 anyway. However, it has occurred to me that I've missed out adding the UTF-8 handling to half of mod_perl's print() method! The method is split into two halves: if (!mod_perl_sent_header(r, 0)) { ... } else { ... } and I've only handled the first half! The first half joins all of the items together and then calls send_cgi_header(). That outputs everything down to the first blank line (i.e. all the headers), then sets the sent headers flag and recurses on $r-print(). Next time around, we'll enter the second half, which simply calls write_client(). If we've already been through the first half then the UTF-8 conversion will have been applied already,
Re: Undocumented behaviour in Apache-print()?
Steve Hay wrote: Hi, I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi. It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache-print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding. [the rest of the very detailed analysis has been snipped] 5.8.0 is a pretty new perl version, which provides the new functionality, and it seems that hardly anybody has been using the UTF stuff with mod_perl. So I suppose you are the first one to hit the problem. Certainly we need to update mod_perl to handle this correctly. Would you be interested to try to make Apache-print() do the right thing? If not, we should log it in the STATUS file and hopefully someone will have the time and inclination to solve it. In any case a simple test that reproduces the problem will be needed. __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache-print()?
Hi Stas, Stas Bekman wrote: Steve Hay wrote: Hi, I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi. It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache-print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding. [the rest of the very detailed analysis has been snipped] 5.8.0 is a pretty new perl version, which provides the new functionality, and it seems that hardly anybody has been using the UTF stuff with mod_perl. 5.8.0 is actually a couple of days short of being one year old (happy birthday!), which is increasingly not that new any more. 5.8.1 should be out soon too. As for hardly anybody using UTF8 stuff with mod_perl... I didn't think that I was until I realised that most XML parsers (certainly the two that I most uses -- XML::LibXML and XML::DOM) return all their data in Perl's internal UTF-8 format! Then the penny dropped that I was actually using it rather a lot :-) So I suppose you are the first one to hit the problem. Certainly we need to update mod_perl to handle this correctly. Would you be interested to try to make Apache-print() do the right thing? Hmm. We really need somebody who understands the internals of Perl and mod_perl better than me, but here's a first stab at it: The Perl source code contains a pp_print() function in pp_hot.c which I presume is basically CORE::print(). It makes use of a do_print() function. I think that function comes from doio.c, although it's actually called Perl_do_print() there. That function does some stuff with the UTF-8 flag, which I guess is the sort of thing that we're after. Here's a chunk of Perl_do_print() from Perl 5.8.0: if (PerlIO_isutf8(fp)) { if (!SvUTF8(sv)) sv_utf8_upgrade(sv = sv_mortalcopy(sv)); } else if (DO_UTF8(sv)) { if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) ckWARN_d(WARN_UTF8)) { Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); } } I think what this does is look to see if the fp (a PerlIO *) has the :utf8 encoding layer. If so, then it upgrades the sv to UTF8 (which is always possible). If not, then it looks to see if the bytes pragma is enabled. If not, then it downgrades the sv from UTF8 (which is not always possible -- if that fails and the UTF8 warnings category is enabled then it outputs the good ol' Wide character in print warning). I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? - we need to code this in such a way as to remain backwards compatible with older Perls. Anyway, it's a start. If not, we should log it in the STATUS file and hopefully someone will have the time and inclination to solve it. Hopefully the above stab at it will encourage somebody to have a serious look. In any case a simple test that reproduces the problem will be needed. This test program reproduces the problem: use 5.008; use Encode; print Content-type: text/plain\n\n, decode('iso-8859-1', 'ü'); Use LWP's get program to get that from an Apache/mod_cgi setup, run it through UNIX's od -c (get http://localhost/cgi-bin/test.pl | od
Re: Undocumented behaviour in Apache-print()?
Steve Hay wrote: 5.8.0 is a pretty new perl version, which provides the new functionality, and it seems that hardly anybody has been using the UTF stuff with mod_perl. 5.8.0 is actually a couple of days short of being one year old (happy birthday!), which is increasingly not that new any more. 5.8.1 should be out soon too. I meant that it was too new to be embraced by the crowd. it'll probably take a few more years before this will happen. In any case, this is just an excuse ;) As for hardly anybody using UTF8 stuff with mod_perl... I didn't think that I was until I realised that most XML parsers (certainly the two that I most uses -- XML::LibXML and XML::DOM) return all their data in Perl's internal UTF-8 format! Then the penny dropped that I was actually using it rather a lot :-) I thought XML was dead. Do people still use this archaic technology? I went to this session at this OS conference with many k00l ppls and there was this dude[1] who said that YAML is the future. Next they started talking about animals, and for some reason everybody liked ponie. All well, orange people [2], orange sites [3], orange ponies [4], jetlag, too many flights, too little sleep... 1: http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/ingy_nino_tired.jpg 2: http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/acme_perl_hacker_scary.jpg 3: http://search.cpan.org/ 4: http://ponie.kwiki.org/ http://www.poniecode.org/ ;) __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache-print()?
[putting the test case on the top] Steve Hay wrote: In any case a simple test that reproduces the problem will be needed. This test program reproduces the problem: use 5.008; use Encode; print Content-type: text/plain\n\n, decode('iso-8859-1', 'ü'); Use LWP's get program to get that from an Apache/mod_cgi setup, run it through UNIX's od -c (get http://localhost/cgi-bin/test.pl | od -c) and you get: 000 374 001 Try the same from an Apache/mod_perl setup and you get: 000 303 274 002 i.e. the double-byte UTF-8 character representing ü that has been output is converted back to ü by Perl's print() [ü is character 252, octal 374], but is left as the two bytes by Apache's print(). I've actually re-built my mod_perl using the half-formed patch given above and it fixes this particular test case! On my linux box it's 'od -b', 'od -c' prints the actual ascii chars. I've tested mp2 and it has the same problem. I've used a different version of your test: #!/usr/bin/perl -w use utf8; print Content-type: text/plain\n\n; $a = \xC3\xBC; utf8::decode($a); print $a; which gives the same char, as in: % perl -le '$a = \xC3\xBC; use utf8; utf8::decode($a); print $a;' ü mod_perl 1.0 and 2.0 respond with: GET 'http://localhost:8002/cgi-bin/test.pl' | od -b 000 303 274 and moc_cgi with 000 374 Hmm. We really need somebody who understands the internals of Perl and mod_perl better than me, but here's a first stab at it: The Perl source code contains a pp_print() function in pp_hot.c which I presume is basically CORE::print(). It makes use of a do_print() function. I think that function comes from doio.c, although it's actually called Perl_do_print() there. That function does some stuff with the UTF-8 flag, which I guess is the sort of thing that we're after. Here's a chunk of Perl_do_print() from Perl 5.8.0: if (PerlIO_isutf8(fp)) { if (!SvUTF8(sv)) sv_utf8_upgrade(sv = sv_mortalcopy(sv)); } else if (DO_UTF8(sv)) { if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) ckWARN_d(WARN_UTF8)) { Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); } } I think what this does is look to see if the fp (a PerlIO *) has the :utf8 encoding layer. If so, then it upgrades the sv to UTF8 (which is always possible). If not, then it looks to see if the bytes pragma is enabled. If not, then it downgrades the sv from UTF8 (which is not always possible -- if that fails and the UTF8 warnings category is enabled then it outputs the good ol' Wide character in print warning). I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? - we need to code this in such a way as to remain backwards compatible with older Perls. looks like this is the main question. Do we handle utf8 only for perl 5.8? __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Re: Undocumented behaviour in Apache-print()?
Stas Bekman wrote: I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? - we need to code this in such a way as to remain backwards compatible with older Perls. looks like this is the main question. Do we handle utf8 only for perl 5.8? It's only Perl 5.8 that has the special UTF-8 flag which the functions above all operate with respect to. If a Perl variable contains a sequence of bytes that make up a valid UTF-8 character, but the string is not flagged with Perl's special flag, then Perl's built-in print() doesn't do this automatic conversion anyway. IOW, print Content-type: text/plain\n\n; $a = \xC3\xBC; print $a; retrieved from a mod_cgi server produces (via od -b / od -c): 000 303 274 002 Perl 5.6 and older don't have the UTF-8 flag and hence don't do any automatic conversion via print(). Therefore, mod_perl's print() should not have the difference from Perl's print() that exists in 5.8, so no change should be required. Sure enough, looking at the doio.c source file in Perl 5.6.1, the entire chunk of code that I half-inched above is not present. Steve
Re: Undocumented behaviour in Apache-print()?
Steve Hay wrote: Stas Bekman wrote: I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? - we need to code this in such a way as to remain backwards compatible with older Perls. looks like this is the main question. Do we handle utf8 only for perl 5.8? It's only Perl 5.8 that has the special UTF-8 flag which the functions above all operate with respect to. If a Perl variable contains a sequence of bytes that make up a valid UTF-8 character, but the string is not flagged with Perl's special flag, then Perl's built-in print() doesn't do this automatic conversion anyway. Yes. Apps wanting to handle utf will need to 'require 5.008;' as in your example. IOW, print Content-type: text/plain\n\n; $a = \xC3\xBC; print $a; retrieved from a mod_cgi server produces (via od -b / od -c): 000 303 274 002 yup, because you need to add utf8::decode($a); before printing $a. Which your version does as well. Perl 5.6 and older don't have the UTF-8 flag and hence don't do any automatic conversion via print(). Therefore, mod_perl's print() should not have the difference from Perl's print() that exists in 5.8, so no change should be required. Sure enough, looking at the doio.c source file in Perl 5.6.1, the entire chunk of code that I half-inched above is not present. So you suggest that we copy this functionality from Perl. So if need to #ifdef it for 5.8.0. I have attempted to shoe-horn this into mod_perl's print() method (in src/modules/perl/Apache.xs). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that if block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).] --- Apache.xs.orig2003-06-06 12:31:10.0 +0100 +++ Apache.xs2003-07-15 12:20:42.0 +0100 @@ -1119,12 +1119,25 @@ SV *sv = sv_newmortal(); SV *rp = ST(0); SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE); +/*PerlIO *fp = PerlIO_stdout();*/ if(items 2) do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */ else sv_setsv(sv, ST(1)); +/*if (PerlIO_isutf8(fp)) { +if (!SvUTF8(sv)) +sv_utf8_upgrade(sv = sv_mortalcopy(sv)); +} +else*/ if (DO_UTF8(sv)) { +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE) + ckWARN_d(WARN_UTF8)) +{ +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print); +} +} + PUSHMARK(sp); XPUSHs(rp); XPUSHs(sv); Besides the problem with PerlIO_isutf8(), mod_perl 1.x doesn't use perlio, hence you have this problem. adding: #include perlio.h should resolve it I think. there are other problems that spring to my mind straight away with this: - is getting the PerlIO * for STDOUT to right thing to be doing anyway? PerlIO *fp = IoOFP(GvIOp(defoutgv)) - if items 2, do we need to handle the UTF8-ness of each of those items individually before we join them? I'm not sure, how perl handles this? __ Stas BekmanJAm_pH -- Just Another mod_perl Hacker http://stason.org/ mod_perl Guide --- http://perl.apache.org mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com http://modperlbook.org http://apache.org http://ticketmaster.com
Undocumented behaviour in Apache-print()?
Hi, I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi. It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache-print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding. Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is passed to print() then by default it will be converted to the machine's native 8-bit character set on output to STDOUT. In my case, this is exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)') before the print(). (If any characters in the UTF-8 string are not representable in ISO-8859-1 then a Wide character in print() warning will be emitted, and the bytes that make up that UTF-8 character will be output.) However, mod_perl's Apache-print() method does not perform this automatic conversion. It simply prints the bytes that make up each UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as if you had called binmode(STDOUT, ':utf8') before Apache-print(). (No Wide character in print() warnings are produced for charcaters with code points 0xFF either.) The test program below illustrates this difference: use 5.008; use strict; use warnings; use Encode; my $cset = 'ISO-8859-1'; #my $cset = 'UTF-8'; print Content-type: text/html; charset=$cset\n\n; print EOT; html head meta http-equiv=Content-type content=text/html; charset=$cset /head body EOT # $str is stored in Perl's internal UTF-8 format. my $str = Encode::decode('iso-8859-1', 'Zurück'); print p$str/p\n; print EOT; /body /html EOT Running under mod_cgi (using Perl's built-in print() function) the UTF-8 encoded data in $str is converted to ISO-8859-1 on-the-fly by the print(), and the end-user will see the intended output when $cset is ISO-8859-1. (Changing $cset to UTF-8 causes the ü to be replaced with ? in my web browser because the ü which is output is not a valid UTF-8 character (which the output is labelled as).) Running under mod_perl (with Perl's built-in print() function now overridden by the Apache-print() method) the UTF-8 encoded data in $str is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the end-user will see the two bytes that make up the UTF-8 representation of ü when $cset is ISO-8859-1. Changing $cset to UTF-8 in this case fixes it, because the output stream in this case happens to be valid UTF-8 all the way through. There are two solutions to this: 1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in $str to ISO-8859-1 yourself before sending it to print(), rather than relying on print() to do that for you. This is, in general, not possible (not all characters in the UTF-8 string may be representable in ISO-8859-1), but for HTML output we can arrange for Encode::encode to convert any non-representable charcaters to their HTML character references: $str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF); 2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all* outgoing data is UTF-8 by adding an appropriate layer on STDOUT: binmode STDOUT, ':utf8'; The second method here is generally to be preferred, but in the old software that I was experiencing problems with, I was not able to add the utf8 layer to STDOUT reliably (the data was being output from a multitude of print() statements scattered in various places), so I stuck with the first method. I believed that it should work without the explicit encoding to ISO-8859-1 because I was unaware that mod_perl's print() override removed Perl's implicit encoding behaviour. Actually, the explicit encoding above is better anyway because it also handles characters that can't be encoded to ISO-8859-1, but nevertheless I think the difference in mod_perl's print() is still worth mentioning in the documentation somewhere. Cheers, Steve