> From: musa ghurab <[EMAIL PROTECTED]>
>
> [...]
>
> my $html = decode('UTF-8', $res->content);
Well in this case what you're doing is you're ignoring whatever encoding is
declared by the server and you're assuming UTF-8 for every page.
Since this worked, it means the problem was within the LWP library, in the
encoding selection. I did some debugging and it turns out that Perl's Web
routines favour the encoding declared in the HTML file over that declared in
the HTTP header. Which, in the case of aljazeera.net, means the wrong encoding
is selected, because it correctly declares UTF-8 in the HTTP header, but
wrongly declares CP-1256 in the HTML document.
Basically at its core the problem is that aljazeera.net declares two
contradictory encodings. The best workaround is to do what all browsers do:
only treat these declarations as hints, and do character decoding ourselves.
Other hacks that will work for this particular site but break others would be,
as you did, to skip detection altogether and always assume UTF-8:
my $html = decode ('UTF-8', $res->content);
Another, similarly inadequate solution would be to always ignore the encoding
declared in the HTML file by inserting the parse_head option around line 214:
my $lwp = new LWP::UserAgent (%{{
agent => $ENV{HTTP_USER_AGENT} || 'Mozilla/5.0',
timeout => 5,
parse_head => 0,
}});
This will solve the problem for this site, but this will break detection on
pages that only declare encoding in the HTML page (such as moheet.com).
Finally, you could tweak the LWP code to reverse the order of precedence of
encoding declarations. But I'm sure that would then break detection on other
sites.
So the bottom line, I'd say, is that:
1. this is not a bug in translate.cgi or in LWP, it's a problem with
aljazeera.net
2. to work around this problem in a way that doesn't break detection on other
sites I think you'd need to do charset detection yourself, ignoring whatever is
declared and analyzing the received bytes. This typically involves language
models etc.
Herve
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support