Addendum at end.
André Warnier wrote:
Hi.
I have a problem with a PerlResponseHandler, regarding the character set
used in the response to a request.
Basically, the question is : how to I set the character set properly for
the "handle" used in
$r->print("string") ?
(where string can be "äéèöü" for example)
Neither of the following (which I do before starting to print output)
seems to work :
$r->headers_out->unset('content-type');
$r->headers_out->set('content-type','text/html;charset=xxxx');
or
$r->content_type('text/html;charset=xxxx');
When I say that it doesn't work, I mean in fact :
- the "Content-Type" response header sent by the server is properly set
according to what I do above (as verified in a browser plugin)
- but if what I print contains "accented" characters, they are not being
encoded properly
So, do I need to set something else so that the $r->print(string) will
output "string" properly ?
Background :
My PerlResponseHandler reads a html file from disk, replaces some
strings into it, and sends the result out via $r->print.
The source html file can be encoded in iso-8859-1 or UTF-8, and it
contains a proper declaration of the charset under which it is really
encoded :
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
or
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
To read the file, I first open it "raw", read a few lines, checking for
the above <meta> tag. If found, I note the charset (say in $charset),
close the file, and re-open it as
open(my $fh,"<:encoding($charset)", $file);
(note : if $charset is "UTF-8", then the open becomes
open(my $fh,'<:utf8', $file);)
I also at that point set the response charset by one of the means above.
Then I read the file line by line, substituting some strings in the
line, and print out the line via
$r->print($line);
etc..
My problem is that, if the input file is for example iso-8859-1 and
contains the word "Männer", the output comes out as "M(A tilde)(some
byte)nner" (the bytes corresponding to the UTF-8 encoding of the "a
umlaut").
Can I / should I do something like
binmode($r,":$charset"); # ??
TIA
Addendum : I added some logging to the ResponseHandler as follows :
PARAM: while (defined($line = <$form_fh>)) {
if ($Debug > 1) {
$r->log->warn(" input line is [$line], utf8 flag : " . (Encode::is_utf8($line) ? "y" :
"n"));
}
The corresponding line in the log, for a line containing the word "männlich",
is :
[Thu Nov 29 10:34:37 2012] [warn] [client 192.168.245.129] input line is [\t\t\t\t<input
name="ANSPR" type="radio" value="m" id="ANSPR"> m\xc3\xa4nnlich\n], utf8 flag : y
Of course, as is usual in the type of case, one never knows how the logfile itself is
written..
But it does confirm that, as read in the Handler, the string is properly encoded
internally in perl, with the utf8 flag set.
However, when I look in the result as received by the browser,
- the browser says that the page received is encoded as iso-8859-1
- the browser's "view page source" confirms that this character is (incorrectly)
represented by 2 bytes :
<input name="ANSPR" type="radio" value="m" id="ANSPR"> männlich