Re: Undocumented behaviour in Apache-print()?

2003-07-25 Thread Steve Hay
Steve Hay wrote:

Stas Bekman wrote:


 I have attempted to shoe-horn this into mod_perl's print() method (in
 src/modules/perl/Apache.xs).  Here's the diff against mod_perl 1.28:
 [Unfortunately, I've had to comment-out the first part of that if
 block, because I got an unresolved external symbol error relating 
to the
 PerlIO_isutf8() function otherwise (which may be because that function
 isn't documented in the perlapio manpage).]

mod_perl 1.x doesn't use perlio, hence you have this problem. adding:

#include perlio.h

should resolve it I think. 


No.  The error was unresolved external symbol, which means that the 
compiler is happy (it evidently has pulled in perlio.h, or something 
else that declares PerlIO_isutf8() as extern ...), but that the 
linker couldn't find the definition of that function.

(Check: If I change PerlIO_isutf8 to PerlIO_isutf (deliberate 
typo) then I get a different error - undefined; assuming extern 
returning int - because now no declaration has been supplied.)

Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 
is *not* one of them.  So where's the definition supposed to come from?

I'll ask about this on the perlxs mailing list, I think. 
Having asked about this, it turns out that the problem was 
PerlIO_isutf8() not being exported from perl58.lib on Windows (and other 
platforms where the symbols to export have to be explicitly listed).

I sent a patch off to p5p which fixes this 
(http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-07/msg01096.html), 
and it has been integrated as #20203.

So presumably this will not be a problem in the upcoming perl-5.8.1, but 
how do we cope with the perl-5.8.0 case?

I've attached a new patch (against mod_perl-1.28) which (I believe) 
fixes the UTF-8 problem, but it won't build on Windows with perl-5.8.0 
without #20203.

Steve
--- Apache.xs.orig  2003-06-06 12:31:10.0 +0100
+++ Apache.xs   2003-07-18 08:47:59.0 +0100
@@ -1119,11 +1119,27 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
+#if PERL_VERSION = 8
+   PerlIO *fp = IoOFP(GvIOp(defoutgv));
+#endif
 
if(items  2)
do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
 else
sv_setsv(sv, ST(1));
+#if PERL_VERSION = 8
+   if (PerlIO_isutf8(fp)) {
+   if (!SvUTF8(sv))
+   sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+   }
+   else if (DO_UTF8(sv)) {
+   if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ckWARN_d(WARN_UTF8))
+   {
+   Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
+   }
+   }
+#endif
 
PUSHMARK(sp);
XPUSHs(rp);
@@ -1176,6 +1192,20 @@
int sent = 0;
 SV *sv = SvROK(ST(i))  (SvTYPE(SvRV(ST(i))) == SVt_PV) ?
  (SV*)SvRV(ST(i)) : ST(i);
+#if PERL_VERSION = 8
+   PerlIO *fp = IoOFP(GvIOp(defoutgv));
+   if (PerlIO_isutf8(fp)) {
+   if (!SvUTF8(sv))
+   sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+   }
+   else if (DO_UTF8(sv)) {
+   if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ckWARN_d(WARN_UTF8))
+   {
+   Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
+   }
+   }
+#endif
buffer = SvPV(sv, len);
 #ifdef APACHE_SSL
 while(len  0) {


Re: Undocumented behaviour in Apache-print()?

2003-07-16 Thread Steve Hay
Stas Bekman wrote:

Steve Hay wrote:

It's only Perl 5.8 that has the special UTF-8 flag which the 
functions above all operate with respect to.  If a Perl variable 
contains a sequence of bytes that make up a valid UTF-8 character, 
but the string is not flagged with Perl's special flag, then Perl's 
built-in print() doesn't do this automatic conversion anyway.


Yes.

Apps wanting to handle utf will need to 'require 5.008;' as in your 
example.

IOW,

   print Content-type: text/plain\n\n;
   $a = \xC3\xBC;
   print $a;
retrieved from a mod_cgi server produces (via od -b / od -c):

   000 303 274
   002


yup, because you need to add utf8::decode($a); before printing $a. 
Which your version does as well. 
(Indeed.  I meant it as example of how Perl's (5.8's) print() doesn't do 
the conversion on strings that are not *flagged* as UTF-8, even when 
they make valid UTF-8.)



Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
automatic conversion via print().  Therefore, mod_perl's print() 
should not have the difference from Perl's print() that exists in 
5.8, so no change should be required.

Sure enough, looking at the doio.c source file in Perl 5.6.1, the 
entire chunk of code that I half-inched above is not present.


So you suggest that we copy this functionality from Perl. So if need 
to #ifdef it for 5.8.0. 
So I'll add

#if PERL_VERSION = 8
...
#endif
around the code that I've added.



 I have attempted to shoe-horn this into mod_perl's print() method (in
 src/modules/perl/Apache.xs).  Here's the diff against mod_perl 1.28:
 [Unfortunately, I've had to comment-out the first part of that if
 block, because I got an unresolved external symbol error relating to 
the
 PerlIO_isutf8() function otherwise (which may be because that function
 isn't documented in the perlapio manpage).]

 --- Apache.xs.orig2003-06-06 12:31:10.0 +0100
 +++ Apache.xs2003-07-15 12:20:42.0 +0100
 @@ -1119,12 +1119,25 @@
 SV *sv = sv_newmortal();
 SV *rp = ST(0);
 SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
 +/*PerlIO *fp = PerlIO_stdout();*/

 if(items  2)
 do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', 
@_[1..$#_] */
 else
 sv_setsv(sv, ST(1));

 +/*if (PerlIO_isutf8(fp)) {
 +if (!SvUTF8(sv))
 +sv_utf8_upgrade(sv = sv_mortalcopy(sv));
 +}
 +else*/ if (DO_UTF8(sv)) {
 +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
 + ckWARN_d(WARN_UTF8))
 +{
 +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in 
print);
 +}
 +}
 +
 PUSHMARK(sp);
 XPUSHs(rp);
 XPUSHs(sv);

 Besides the problem with PerlIO_isutf8(),

mod_perl 1.x doesn't use perlio, hence you have this problem. adding:

#include perlio.h

should resolve it I think. 
No.  The error was unresolved external symbol, which means that the 
compiler is happy (it evidently has pulled in perlio.h, or something 
else that declares PerlIO_isutf8() as extern ...), but that the linker 
couldn't find the definition of that function.

(Check: If I change PerlIO_isutf8 to PerlIO_isutf (deliberate typo) 
then I get a different error - undefined; assuming extern returning 
int - because now no declaration has been supplied.)

Listing the symbols exported from perl58.lib shows that PerlIO_isutf8 is 
*not* one of them.  So where's the definition supposed to come from?

I'll ask about this on the perlxs mailing list, I think.



 there are other problems that
 spring to my mind straight away with this:
 - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
PerlIO *fp = IoOFP(GvIOp(defoutgv)) 
Seems to work OK for me.  What's defoutgv?



 - if items  2, do we need to handle the UTF8-ness of each of those
 items individually before we join them?
I'm not sure, how perl handles this? 
Struggling as best as I can to read pp_print() in Perl's pp_hot.c, it 
looks like Perl calls do_print() (which contains the UTF-8 handling that 
I've stolen) for each item in the list that is passed to it.

Considering this more, I think that it probably isn't an issue: if you 
have two variables in Perl, one of which is flagged UTF-8 and the other 
of which isn't, then when you concatenate them, the whole is upgraded 
to flagged UTF-8 anyway.

However, it has occurred to me that I've missed out adding the UTF-8 
handling to half of mod_perl's print() method!

The method is split into two halves:

   if (!mod_perl_sent_header(r, 0)) {
   ...
   } else {
   ...
   }
and I've only handled the first half!

The first half joins all of the items together and then calls 
send_cgi_header().  That outputs everything down to the first blank line 
(i.e. all the headers), then sets the sent headers flag and recurses 
on $r-print().  Next time around, we'll enter the second half, which 
simply calls write_client().

If we've already been through the first half then the UTF-8 conversion 
will have been applied already, 

Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Stas Bekman
Steve Hay wrote:
Hi,

I've just spent quite a while tracking down a problem with a web page 
generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
were not being shown properly.  The software runs via Apache::Registry, 
and works fine under mod_cgi.

It turns out that the problem is due to a difference in behaviour 
between Perl's built-in print() function in Perl 5.8.0+ and the 
Apache-print() method that mod_perl overrides it with.  I've consulted 
the documentation on the mod_perl website, and could find no mention of 
the difference.  If my conclusions below are correct then this 
information may well be worth adding.
[the rest of the very detailed analysis has been snipped]

5.8.0 is a pretty new perl version, which provides the new functionality, and 
it seems that hardly anybody has been using the UTF stuff with mod_perl. So I 
suppose you are the first one to hit the problem. Certainly we need to update 
mod_perl to handle this correctly. Would you be interested to try to make 
Apache-print() do the right thing? If not, we should log it in the STATUS 
file and hopefully someone will have the time and inclination to solve it.

In any case a simple test that reproduces the problem will be needed.

__
Stas BekmanJAm_pH -- Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide --- http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com


Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Steve Hay
Hi Stas,

Stas Bekman wrote:

Steve Hay wrote:

Hi,

I've just spent quite a while tracking down a problem with a web page 
generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
were not being shown properly.  The software runs via 
Apache::Registry, and works fine under mod_cgi.

It turns out that the problem is due to a difference in behaviour 
between Perl's built-in print() function in Perl 5.8.0+ and the 
Apache-print() method that mod_perl overrides it with.  I've 
consulted the documentation on the mod_perl website, and could find 
no mention of the difference.  If my conclusions below are correct 
then this information may well be worth adding.


[the rest of the very detailed analysis has been snipped]

5.8.0 is a pretty new perl version, which provides the new 
functionality, and it seems that hardly anybody has been using the UTF 
stuff with mod_perl.
5.8.0 is actually a couple of days short of being one year old (happy 
birthday!), which is increasingly not that new any more.  5.8.1 should 
be out soon too.

As for hardly anybody using UTF8 stuff with mod_perl... I didn't think 
that I was until I realised that most XML parsers (certainly the two 
that I most uses -- XML::LibXML and XML::DOM) return all their data in 
Perl's internal UTF-8 format!  Then the penny dropped that I was 
actually using it rather a lot :-)

So I suppose you are the first one to hit the problem. Certainly we 
need to update mod_perl to handle this correctly. Would you be 
interested to try to make Apache-print() do the right thing?
Hmm.  We really need somebody who understands the internals of Perl and 
mod_perl better than me, but here's a first stab at it:

The Perl source code contains a pp_print() function in pp_hot.c which 
I presume is basically CORE::print().  It makes use of a do_print() 
function.  I think that function comes from doio.c, although it's 
actually called Perl_do_print() there.  That function does some stuff 
with the UTF-8 flag, which I guess is the sort of thing that we're 
after.  Here's a chunk of Perl_do_print() from Perl 5.8.0:

   if (PerlIO_isutf8(fp)) {
   if (!SvUTF8(sv))
   sv_utf8_upgrade(sv = sv_mortalcopy(sv));
   }
   else if (DO_UTF8(sv)) {
   if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
ckWARN_d(WARN_UTF8))
   {
   Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
   }
   }
I think what this does is look to see if the fp (a PerlIO *) has the 
:utf8 encoding layer.  If so, then it upgrades the sv to UTF8 (which 
is always possible).  If not, then it looks to see if the bytes pragma 
is enabled.  If not, then it downgrades the sv from UTF8 (which is not 
always possible -- if that fails and the UTF8 warnings category is 
enabled then it outputs the good ol' Wide character in print warning).

I have attempted to shoe-horn this into mod_perl's print() method (in 
src/modules/perl/Apache.xs).  Here's the diff against mod_perl 1.28:  
[Unfortunately, I've had to comment-out the first part of that if 
block, because I got an unresolved external symbol error relating to the 
PerlIO_isutf8() function otherwise (which may be because that function 
isn't documented in the perlapio manpage).]

--- Apache.xs.orig2003-06-06 12:31:10.0 +0100
+++ Apache.xs2003-07-15 12:20:42.0 +0100
@@ -1119,12 +1119,25 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
+/*PerlIO *fp = PerlIO_stdout();*/
if(items  2)
do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
else
sv_setsv(sv, ST(1));
+/*if (PerlIO_isutf8(fp)) {
+if (!SvUTF8(sv))
+sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+}
+else*/ if (DO_UTF8(sv)) {
+if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ ckWARN_d(WARN_UTF8))
+{
+Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
+}
+}
+
PUSHMARK(sp);
XPUSHs(rp);
XPUSHs(sv);
Besides the problem with PerlIO_isutf8(), there are other problems that 
spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if items  2, do we need to handle the UTF8-ness of each of those 
items individually before we join them?
- we need to code this in such a way as to remain backwards compatible 
with older Perls.

Anyway, it's a start.

If not, we should log it in the STATUS file and hopefully someone will 
have the time and inclination to solve it. 
Hopefully the above stab at it will encourage somebody to have a serious 
look.



In any case a simple test that reproduces the problem will be needed. 
This test program reproduces the problem:

   use 5.008;
   use Encode;
   print Content-type: text/plain\n\n, decode('iso-8859-1', 'ü');
Use LWP's get program to get that from an Apache/mod_cgi setup, run it 
through UNIX's od -c (get http://localhost/cgi-bin/test.pl | od 

Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Stas Bekman
Steve Hay wrote:

5.8.0 is a pretty new perl version, which provides the new 
functionality, and it seems that hardly anybody has been using the UTF 
stuff with mod_perl.


5.8.0 is actually a couple of days short of being one year old (happy 
birthday!), which is increasingly not that new any more.  5.8.1 should 
be out soon too.
I meant that it was too new to be embraced by the crowd. it'll probably take a 
few more years before this will happen. In any case, this is just an excuse ;)

As for hardly anybody using UTF8 stuff with mod_perl... I didn't think 
that I was until I realised that most XML parsers (certainly the two 
that I most uses -- XML::LibXML and XML::DOM) return all their data in 
Perl's internal UTF-8 format!  Then the penny dropped that I was 
actually using it rather a lot :-)
I thought XML was dead. Do people still use this archaic technology? I went to 
this session at this OS conference with many k00l ppls and there was this 
dude[1] who said that YAML is the future. Next they started talking about 
animals, and for some reason everybody liked ponie. All well, orange people 
[2], orange sites [3], orange ponies [4], jetlag, too many flights, too little 
sleep...

1: 
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/ingy_nino_tired.jpg
2: 
http://husk.org/pics/imgs/people/perl/london.pm_ingy_2001-07-30/acme_perl_hacker_scary.jpg
3: http://search.cpan.org/
4: http://ponie.kwiki.org/ http://www.poniecode.org/

;)

__
Stas BekmanJAm_pH -- Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide --- http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com


Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Stas Bekman
[putting the test case on the top]

Steve Hay wrote:

 In any case a simple test that reproduces the problem will be needed.


 This test program reproduces the problem:

use 5.008;
use Encode;
print Content-type: text/plain\n\n, decode('iso-8859-1', 'ü');

 Use LWP's get program to get that from an Apache/mod_cgi setup, run it
 through UNIX's od -c (get http://localhost/cgi-bin/test.pl | od -c)
 and you get:

000 374
001

 Try the same from an Apache/mod_perl setup and you get:

000 303 274
002

 i.e. the double-byte UTF-8 character representing ü that has been output
 is converted back to ü by Perl's print() [ü is character 252, octal
 374], but is left as the two bytes by Apache's print().

 I've actually re-built my mod_perl using the half-formed patch given
 above and it fixes this particular test case!
On my linux box it's 'od -b', 'od -c' prints the actual ascii chars.

I've tested mp2 and it has the same problem. I've used a different version of 
your test:

#!/usr/bin/perl -w
use utf8;
print Content-type: text/plain\n\n;
$a = \xC3\xBC;
utf8::decode($a); print $a;
which gives the same char, as in:
% perl -le '$a = \xC3\xBC; use utf8; utf8::decode($a); print $a;'
ü
mod_perl 1.0 and 2.0 respond with:

GET 'http://localhost:8002/cgi-bin/test.pl' | od -b
000 303 274
and moc_cgi with
000 374

Hmm.  We really need somebody who understands the internals of Perl and 
mod_perl better than me, but here's a first stab at it:

The Perl source code contains a pp_print() function in pp_hot.c which 
I presume is basically CORE::print().  It makes use of a do_print() 
function.  I think that function comes from doio.c, although it's 
actually called Perl_do_print() there.  That function does some stuff 
with the UTF-8 flag, which I guess is the sort of thing that we're 
after.  Here's a chunk of Perl_do_print() from Perl 5.8.0:

   if (PerlIO_isutf8(fp)) {
   if (!SvUTF8(sv))
   sv_utf8_upgrade(sv = sv_mortalcopy(sv));
   }
   else if (DO_UTF8(sv)) {
   if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
ckWARN_d(WARN_UTF8))
   {
   Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
   }
   }
I think what this does is look to see if the fp (a PerlIO *) has the 
:utf8 encoding layer.  If so, then it upgrades the sv to UTF8 (which 
is always possible).  If not, then it looks to see if the bytes pragma 
is enabled.  If not, then it downgrades the sv from UTF8 (which is not 
always possible -- if that fails and the UTF8 warnings category is 
enabled then it outputs the good ol' Wide character in print warning).

I have attempted to shoe-horn this into mod_perl's print() method (in 
src/modules/perl/Apache.xs).  Here's the diff against mod_perl 1.28:  
[Unfortunately, I've had to comment-out the first part of that if 
block, because I got an unresolved external symbol error relating to the 
PerlIO_isutf8() function otherwise (which may be because that function 
isn't documented in the perlapio manpage).]

--- Apache.xs.orig2003-06-06 12:31:10.0 +0100
+++ Apache.xs2003-07-15 12:20:42.0 +0100
@@ -1119,12 +1119,25 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
+/*PerlIO *fp = PerlIO_stdout();*/
if(items  2)
do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
else
sv_setsv(sv, ST(1));
+/*if (PerlIO_isutf8(fp)) {
+if (!SvUTF8(sv))
+sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+}
+else*/ if (DO_UTF8(sv)) {
+if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ ckWARN_d(WARN_UTF8))
+{
+Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
+}
+}
+
PUSHMARK(sp);
XPUSHs(rp);
XPUSHs(sv);
Besides the problem with PerlIO_isutf8(), there are other problems that 
spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if items  2, do we need to handle the UTF8-ness of each of those 
items individually before we join them?
- we need to code this in such a way as to remain backwards compatible 
with older Perls.
looks like this is the main question. Do we handle utf8 only for perl 5.8?

__
Stas BekmanJAm_pH -- Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide --- http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com


Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Steve Hay
Stas Bekman wrote:

I have attempted to shoe-horn this into mod_perl's print() method (in 
src/modules/perl/Apache.xs).  Here's the diff against mod_perl 
1.28:  [Unfortunately, I've had to comment-out the first part of that 
if block, because I got an unresolved external symbol error 
relating to the PerlIO_isutf8() function otherwise (which may be 
because that function isn't documented in the perlapio manpage).]

--- Apache.xs.orig2003-06-06 12:31:10.0 +0100
+++ Apache.xs2003-07-15 12:20:42.0 +0100
@@ -1119,12 +1119,25 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
+/*PerlIO *fp = PerlIO_stdout();*/
if(items  2)
do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
else
sv_setsv(sv, ST(1));
+/*if (PerlIO_isutf8(fp)) {
+if (!SvUTF8(sv))
+sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+}
+else*/ if (DO_UTF8(sv)) {
+if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ ckWARN_d(WARN_UTF8))
+{
+Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in 
print);
+}
+}
+
PUSHMARK(sp);
XPUSHs(rp);
XPUSHs(sv);

Besides the problem with PerlIO_isutf8(), there are other problems 
that spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if items  2, do we need to handle the UTF8-ness of each of those 
items individually before we join them?
- we need to code this in such a way as to remain backwards 
compatible with older Perls.


looks like this is the main question. Do we handle utf8 only for perl 
5.8? 
It's only Perl 5.8 that has the special UTF-8 flag which the functions 
above all operate with respect to.  If a Perl variable contains a 
sequence of bytes that make up a valid UTF-8 character, but the string 
is not flagged with Perl's special flag, then Perl's built-in print() 
doesn't do this automatic conversion anyway.

IOW,

   print Content-type: text/plain\n\n;
   $a = \xC3\xBC;
   print $a;
retrieved from a mod_cgi server produces (via od -b / od -c):

   000 303 274
   002
Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
automatic conversion via print().  Therefore, mod_perl's print() should 
not have the difference from Perl's print() that exists in 5.8, so no 
change should be required.

Sure enough, looking at the doio.c source file in Perl 5.6.1, the 
entire chunk of code that I half-inched above is not present.

Steve



Re: Undocumented behaviour in Apache-print()?

2003-07-15 Thread Stas Bekman
Steve Hay wrote:
Stas Bekman wrote:

I have attempted to shoe-horn this into mod_perl's print() method (in 
src/modules/perl/Apache.xs).  Here's the diff against mod_perl 
1.28:  [Unfortunately, I've had to comment-out the first part of that 
if block, because I got an unresolved external symbol error 
relating to the PerlIO_isutf8() function otherwise (which may be 
because that function isn't documented in the perlapio manpage).]

--- Apache.xs.orig2003-06-06 12:31:10.0 +0100
+++ Apache.xs2003-07-15 12:20:42.0 +0100
@@ -1119,12 +1119,25 @@
SV *sv = sv_newmortal();
SV *rp = ST(0);
SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
+/*PerlIO *fp = PerlIO_stdout();*/
if(items  2)
do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
else
sv_setsv(sv, ST(1));
+/*if (PerlIO_isutf8(fp)) {
+if (!SvUTF8(sv))
+sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+}
+else*/ if (DO_UTF8(sv)) {
+if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+ ckWARN_d(WARN_UTF8))
+{
+Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in 
print);
+}
+}
+
PUSHMARK(sp);
XPUSHs(rp);
XPUSHs(sv);

Besides the problem with PerlIO_isutf8(), there are other problems 
that spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if items  2, do we need to handle the UTF8-ness of each of those 
items individually before we join them?
- we need to code this in such a way as to remain backwards 
compatible with older Perls.


looks like this is the main question. Do we handle utf8 only for perl 
5.8? 


It's only Perl 5.8 that has the special UTF-8 flag which the functions 
above all operate with respect to.  If a Perl variable contains a 
sequence of bytes that make up a valid UTF-8 character, but the string 
is not flagged with Perl's special flag, then Perl's built-in print() 
doesn't do this automatic conversion anyway.
Yes.

Apps wanting to handle utf will need to 'require 5.008;' as in your example.

IOW,

   print Content-type: text/plain\n\n;
   $a = \xC3\xBC;
   print $a;
retrieved from a mod_cgi server produces (via od -b / od -c):

   000 303 274
   002
yup, because you need to add utf8::decode($a); before printing $a. Which your 
version does as well.

Perl 5.6 and older don't have the UTF-8 flag and hence don't do any 
automatic conversion via print().  Therefore, mod_perl's print() should 
not have the difference from Perl's print() that exists in 5.8, so no 
change should be required.

Sure enough, looking at the doio.c source file in Perl 5.6.1, the 
entire chunk of code that I half-inched above is not present.
So you suggest that we copy this functionality from Perl. So if need to #ifdef 
it for 5.8.0.

 I have attempted to shoe-horn this into mod_perl's print() method (in
 src/modules/perl/Apache.xs).  Here's the diff against mod_perl 1.28:
 [Unfortunately, I've had to comment-out the first part of that if
 block, because I got an unresolved external symbol error relating to the
 PerlIO_isutf8() function otherwise (which may be because that function
 isn't documented in the perlapio manpage).]

 --- Apache.xs.orig2003-06-06 12:31:10.0 +0100
 +++ Apache.xs2003-07-15 12:20:42.0 +0100
 @@ -1119,12 +1119,25 @@
 SV *sv = sv_newmortal();
 SV *rp = ST(0);
 SV *sendh = perl_get_sv(Apache::__SendHeader, TRUE);
 +/*PerlIO *fp = PerlIO_stdout();*/

 if(items  2)
 do_join(sv, sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
 else
 sv_setsv(sv, ST(1));

 +/*if (PerlIO_isutf8(fp)) {
 +if (!SvUTF8(sv))
 +sv_utf8_upgrade(sv = sv_mortalcopy(sv));
 +}
 +else*/ if (DO_UTF8(sv)) {
 +if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
 + ckWARN_d(WARN_UTF8))
 +{
 +Perl_warner(aTHX_ packWARN(WARN_UTF8), Wide character in print);
 +}
 +}
 +
 PUSHMARK(sp);
 XPUSHs(rp);
 XPUSHs(sv);

 Besides the problem with PerlIO_isutf8(),
mod_perl 1.x doesn't use perlio, hence you have this problem. adding:

#include perlio.h

should resolve it I think.

 there are other problems that
 spring to my mind straight away with this:
 - is getting the PerlIO * for STDOUT to right thing to be doing anyway?
PerlIO *fp = IoOFP(GvIOp(defoutgv))

 - if items  2, do we need to handle the UTF8-ness of each of those
 items individually before we join them?
I'm not sure, how perl handles this?

__
Stas BekmanJAm_pH -- Just Another mod_perl Hacker
http://stason.org/ mod_perl Guide --- http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com


Undocumented behaviour in Apache-print()?

2003-07-11 Thread Steve Hay
Hi,

I've just spent quite a while tracking down a problem with a web page 
generated by a mod_perl program in which 8-bit ISO-8859-1 characters 
were not being shown properly.  The software runs via Apache::Registry, 
and works fine under mod_cgi.

It turns out that the problem is due to a difference in behaviour 
between Perl's built-in print() function in Perl 5.8.0+ and the 
Apache-print() method that mod_perl overrides it with.  I've consulted 
the documentation on the mod_perl website, and could find no mention of 
the difference.  If my conclusions below are correct then this 
information may well be worth adding.

Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is 
passed to print() then by default it will be converted to the machine's 
native 8-bit character set on output to STDOUT.  In my case, this is 
exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)') 
before the print().  (If any characters in the UTF-8 string are not 
representable in ISO-8859-1 then a Wide character in print() warning 
will be emitted, and the bytes that make up that UTF-8 character will be 
output.)

However, mod_perl's Apache-print() method does not perform this 
automatic conversion.  It simply prints the bytes that make up each 
UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as 
if you had called binmode(STDOUT, ':utf8') before Apache-print().  (No 
Wide character in print() warnings are produced for charcaters with 
code points  0xFF either.)

The test program below illustrates this difference:

   use 5.008;
   use strict;
   use warnings;
   use Encode;
   my $cset = 'ISO-8859-1';
   #my $cset = 'UTF-8';
   print Content-type: text/html; charset=$cset\n\n;
   print EOT;
   html
   head
   meta http-equiv=Content-type content=text/html; charset=$cset
   /head
   body
   EOT
   # $str is stored in Perl's internal UTF-8 format.
   my $str = Encode::decode('iso-8859-1', 'Zurück');
   print p$str/p\n;
   print EOT;
   /body
   /html
   EOT
Running under mod_cgi (using Perl's built-in print() function) the UTF-8 
encoded data in $str is converted to ISO-8859-1 on-the-fly by the 
print(), and the end-user will see the intended output when $cset is 
ISO-8859-1.  (Changing $cset to UTF-8 causes the ü to be replaced with ? 
in my web browser because the ü which is output is not a valid UTF-8 
character (which the output is labelled as).)

Running under mod_perl (with Perl's built-in print() function now 
overridden by the Apache-print() method) the UTF-8 encoded data in $str 
is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the 
end-user will see the two bytes that make up the UTF-8 representation of 
ü when $cset is ISO-8859-1.  Changing $cset to UTF-8 in this case 
fixes it, because the output stream in this case happens to be valid 
UTF-8 all the way through.

There are two solutions to this:

1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in 
$str to ISO-8859-1 yourself before sending it to print(), rather than 
relying on print() to do that for you.  This is, in general, not 
possible (not all characters in the UTF-8 string may be representable in 
ISO-8859-1), but for HTML output we can arrange for Encode::encode to 
convert any non-representable charcaters to their HTML character references:

   $str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF);

2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all* 
outgoing data is UTF-8 by adding an appropriate layer on STDOUT:

   binmode STDOUT, ':utf8';

The second method here is generally to be preferred, but in the old 
software that I was experiencing problems with, I was not able to add 
the utf8 layer to STDOUT reliably (the data was being output from a 
multitude of print() statements scattered in various places), so I stuck 
with the first method.  I believed that it should work without the 
explicit encoding to ISO-8859-1 because I was unaware that mod_perl's 
print() override removed Perl's implicit encoding behaviour.  Actually, 
the explicit encoding above is better anyway because it also handles 
characters that can't be encoded to ISO-8859-1, but nevertheless I think 
the difference in mod_perl's print() is still worth mentioning in the 
documentation somewhere.

Cheers,

Steve