> From: musa ghurab <[EMAIL PROTECTED]>
> 
> [...]
> 
> my $html = decode('UTF-8', $res->content);


Well in this case what you're doing is you're ignoring whatever encoding is 
declared by the server and you're assuming UTF-8 for every page.

Since this worked, it means the problem was within the LWP library, in the 
encoding selection. I did some debugging and it turns out that Perl's Web 
routines favour the encoding declared in the HTML file over that declared in 
the HTTP header. Which, in the case of aljazeera.net, means the wrong encoding 
is selected, because it correctly declares UTF-8 in the HTTP header, but 
wrongly declares CP-1256 in the HTML document.


Basically at its core the problem is that aljazeera.net declares two 
contradictory encodings. The best workaround is to do what all browsers do: 
only treat these declarations as hints, and do character decoding ourselves.


Other hacks that will work for this particular site but break others would be, 
as you did, to skip detection altogether and always assume UTF-8:

    my $html = decode ('UTF-8', $res->content);

Another, similarly inadequate solution would be to always ignore the encoding 
declared in the HTML file by inserting the parse_head option around line 214:

    my $lwp = new LWP::UserAgent (%{{
        agent      => $ENV{HTTP_USER_AGENT} || 'Mozilla/5.0',
        timeout    => 5,
        parse_head => 0,
    }});

This will solve the problem for this site, but this will break detection on 
pages that only declare encoding in the HTML page (such as moheet.com).

Finally, you could tweak the LWP code to reverse the order of precedence of 
encoding declarations. But I'm sure that would then break detection on other 
sites.


So the bottom line, I'd say, is that:

1. this is not a bug in translate.cgi or in LWP, it's a problem with 
aljazeera.net

2. to work around this problem in a way that doesn't break detection on other 
sites I think you'd need to do charset detection yourself, ignoring whatever is 
declared and analyzing the received bytes. This typically involves language 
models etc.


Herve

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to