I've just spent quite a while tracking down a problem with a web page generated by a mod_perl program in which 8-bit ISO-8859-1 characters were not being shown properly. The software runs via Apache::Registry, and works fine under mod_cgi.
It turns out that the problem is due to a difference in behaviour between Perl's built-in print() function in Perl 5.8.0+ and the Apache->print() method that mod_perl overrides it with. I've consulted the documentation on the mod_perl website, and could find no mention of the difference. If my conclusions below are correct then this information may well be worth adding.
Under Perl 5.8.0, if a string stored in Perl's internal UTF-8 format is passed to print() then by default it will be converted to the machine's native 8-bit character set on output to STDOUT. In my case, this is exactly as if I had called binmode(STDOUT, ':encoding(iso-8859-1)') before the print(). (If any characters in the UTF-8 string are not representable in ISO-8859-1 then a "Wide character in print()" warning will be emitted, and the bytes that make up that UTF-8 character will be output.)
However, mod_perl's Apache->print() method does not perform this automatic conversion. It simply prints the bytes that make up each UTF-8 character (i.e. it outputs the UTF-8 string as UTF-8), exactly as if you had called binmode(STDOUT, ':utf8') before Apache->print(). (No "Wide character in print()" warnings are produced for charcaters with code points > 0xFF either.)
The test program below illustrates this difference:
use 5.008; use strict; use warnings; use Encode;
my $cset = 'ISO-8859-1'; #my $cset = 'UTF-8';
print "Content-type: text/html; charset=$cset\n\n"; print <<EOT; <html> <head> <meta http-equiv="Content-type" content="text/html; charset=$cset"> </head> <body> EOT
# $str is stored in Perl's internal UTF-8 format. my $str = Encode::decode('iso-8859-1', 'Zurück'); print "<p>$str</p>\n";
print <<EOT; </body> </html> EOT
Running under mod_cgi (using Perl's built-in print() function) the UTF-8 encoded data in $str is converted to ISO-8859-1 on-the-fly by the print(), and the end-user will see the intended output when $cset is ISO-8859-1. (Changing $cset to UTF-8 causes the ü to be replaced with ? in my web browser because the ü which is output is not a valid UTF-8 character (which the output is labelled as).)
Running under mod_perl (with Perl's built-in print() function now overridden by the Apache->print() method) the UTF-8 encoded data in $str is NOT converted to ISO-8859-1 on-the-fly as it is printed, and the end-user will see the two bytes that make up the UTF-8 representation of ü when $cset is ISO-8859-1. Changing $cset to UTF-8 in this case "fixes" it, because the output stream in this case happens to be valid UTF-8 all the way through.
There are two solutions to this:
1. To use $cset = 'ISO-8859-1': Explicitly convert the UTF-8 data in $str to ISO-8859-1 yourself before sending it to print(), rather than relying on print() to do that for you. This is, in general, not possible (not all characters in the UTF-8 string may be representable in ISO-8859-1), but for HTML output we can arrange for Encode::encode to convert any non-representable charcaters to their HTML character references:
$str = Encode::encode('iso-8859-1', $str, Encode::FB_HTMLCREF);
2. To use $cset = 'UTF-8': Output UTF-8 directly, ensuring that *all* outgoing data is UTF-8 by adding an appropriate layer on STDOUT:
binmode STDOUT, ':utf8';
The second method here is generally to be preferred, but in the old software that I was experiencing problems with, I was not able to add the utf8 layer to STDOUT reliably (the data was being output from a multitude of print() statements scattered in various places), so I stuck with the first method. I believed that it should work without the explicit encoding to ISO-8859-1 because I was unaware that mod_perl's print() override removed Perl's implicit encoding behaviour. Actually, the explicit encoding above is better anyway because it also handles characters that can't be encoded to ISO-8859-1, but nevertheless I think the difference in mod_perl's print() is still worth mentioning in the documentation somewhere.