[putting the test case on the top]

Steve Hay wrote:

>> In any case a simple test that reproduces the problem will be needed.
>
>
> This test program reproduces the problem:
>
>    use 5.008;
>    use Encode;
>    print "Content-type: text/plain\n\n", decode('iso-8859-1', 'ü');
>
> Use LWP's "get" program to get that from an Apache/mod_cgi setup, run it
> through UNIX's "od -c" (get http://localhost/cgi-bin/test.pl | od -c)
> and you get:
>
>    0000000 374
>    0000001
>
> Try the same from an Apache/mod_perl setup and you get:
>
>    0000000 303 274
>    0000002
>
> i.e. the double-byte UTF-8 character representing ü that has been output
> is converted back to ü by Perl's print() [ü is character 252, octal
> 374], but is left as the two bytes by Apache's print().
>
> I've actually re-built my mod_perl using the half-formed patch given
> above and it fixes this particular test case!

On my linux box it's 'od -b', 'od -c' prints the actual ascii chars.

I've tested mp2 and it has the same problem. I've used a different version of your test:

#!/usr/bin/perl -w
use utf8;
print "Content-type: text/plain\n\n";
$a = "\xC3\xBC";
utf8::decode($a); print $a;

which gives the same char, as in:
% perl -le '$a = "\xC3\xBC"; use utf8; utf8::decode($a); print $a;'
ü

mod_perl 1.0 and 2.0 respond with:

GET 'http://localhost:8002/cgi-bin/test.pl' | od -b
0000000 303 274

and moc_cgi with
0000000 374


Hmm. We really need somebody who understands the internals of Perl and mod_perl better than me, but here's a first stab at it:

The Perl source code contains a pp_print() function in "pp_hot.c" which I presume is basically CORE::print(). It makes use of a do_print() function. I think that function comes from "doio.c", although it's actually called Perl_do_print() there. That function does some stuff with the UTF-8 flag, which I guess is the sort of thing that we're after. Here's a chunk of Perl_do_print() from Perl 5.8.0:

   if (PerlIO_isutf8(fp)) {
       if (!SvUTF8(sv))
       sv_utf8_upgrade(sv = sv_mortalcopy(sv));
   }
   else if (DO_UTF8(sv)) {
       if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
       && ckWARN_d(WARN_UTF8))
       {
       Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
       }
   }

I think what this does is look to see if the fp (a PerlIO *) has the ":utf8" encoding layer. If so, then it upgrades the sv to UTF8 (which is always possible). If not, then it looks to see if the "bytes" pragma is enabled. If not, then it downgrades the sv from UTF8 (which is not always possible -- if that fails and the UTF8 warnings category is enabled then it outputs the good ol' "Wide character in print" warning).

I have attempted to shoe-horn this into mod_perl's print() method (in "src/modules/perl/Apache.xs"). Here's the diff against mod_perl 1.28: [Unfortunately, I've had to comment-out the first part of that "if" block, because I got an unresolved external symbol error relating to the PerlIO_isutf8() function otherwise (which may be because that function isn't documented in the perlapio manpage).]

--- Apache.xs.orig    2003-06-06 12:31:10.000000000 +0100
+++ Apache.xs    2003-07-15 12:20:42.000000000 +0100
@@ -1119,12 +1119,25 @@
    SV *sv = sv_newmortal();
    SV *rp = ST(0);
    SV *sendh = perl_get_sv("Apache::__SendHeader", TRUE);
+    /*PerlIO *fp = PerlIO_stdout();*/

    if(items > 2)
        do_join(sv, &sv_no, MARK+1, SP); /* $sv = join '', @_[1..$#_] */
    else
        sv_setsv(sv, ST(1));

+    /*if (PerlIO_isutf8(fp)) {
+        if (!SvUTF8(sv))
+        sv_utf8_upgrade(sv = sv_mortalcopy(sv));
+    }
+    else*/ if (DO_UTF8(sv)) {
+        if (!sv_utf8_downgrade((sv = sv_mortalcopy(sv)), TRUE)
+        && ckWARN_d(WARN_UTF8))
+        {
+        Perl_warner(aTHX_ packWARN(WARN_UTF8), "Wide character in print");
+        }
+    }
+
    PUSHMARK(sp);
    XPUSHs(rp);
    XPUSHs(sv);

Besides the problem with PerlIO_isutf8(), there are other problems that spring to my mind straight away with this:
- is getting the PerlIO * for STDOUT to right thing to be doing anyway?
- if "items > 2", do we need to handle the UTF8-ness of each of those items individually before we join them?
- we need to code this in such a way as to remain backwards compatible with older Perls.

looks like this is the main question. Do we handle utf8 only for perl 5.8?


__________________________________________________________________
Stas Bekman            JAm_pH ------> Just Another mod_perl Hacker
http://stason.org/     mod_perl Guide ---> http://perl.apache.org
mailto:[EMAIL PROTECTED] http://use.perl.org http://apacheweek.com
http://modperlbook.org http://apache.org   http://ticketmaster.com



Reply via email to