Re: [Moses-support] FW: web-based coding problem?

Herve Saint-Amand Fri, 08 Aug 2008 03:12:26 -0700

Hi all,

decode_entities has nothing to do with character encodings. It replaces HTML 
entities such as &nbsp; or &eacute; with the character they stand for. Decoding 
is a separate process.



The part that does the decoding is line 220:

    my $html = $res->decoded_content;

The decoded_content method (from the HTTP::Message class) uses the character 
set declared in the HTTP response or in the HTML file itself to convert bytes 
to characters. If neither are present, I think it will assume ISO-8859-1 as a 
default.

I think translate.cgi as it is works with any encoding, as long as they are 
declared somewhere, i.e., it does not do character set detection.


Now if you know that the system will always be used for pages in a certain 
encoding, you could override this decoding by doing it yourself, e.g., by 
replacing line 220 with this:

    my $html = decode ('CP-1256', $res->content);

But obviously this only works if every page you serve is CP-1256, or if you 
have any other means of recognizing it, which is probably not the case.


Otherwise you'll have to look into character set detection. As a start you 
could look into the CharsetDetector package, I've never used it myself but it 
looks promising:

    http://search.cpan.org/perldoc?CharsetDetector


Good luck,
Herve


> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]
> On Behalf Of Philipp Koehn
> Sent: 06 August 2008 15:20
> To: musa ghurab
> Cc: [email protected]
> Subject: Re: [Moses-support] web-based coding problem?
> 
> Hi,
> 
> you probably have to extend the code yourself to (a) detect
> the HTML page's
> encoding and (b) convert it into UTF8 (which should be very
> straight-forward
> in Perl).
> 
> -phi
> 
> On Sat, Aug 2, 2008 at 5:48 PM, musa ghurab
> <[EMAIL PROTECTED]> wrote:
> > Hi all
> >
> > I'm facing problem with the moses web-based,
> problem related to encoding.
> > In web-root file: translate.cgi line: 234
> >
> > $html=decode_entities($html);
> >
> > decode_entities(page coding: windows-1256)?wrong
> coding (not utf8)
> >
> > This is converting the fetched text from iso coding to
> utf8 coding. 
> > But what I got is when fetch page other than utf8 such
> as Arabic 
> > (windows-1256
> > (cp-1256)) or any page not declaring the coding in the
> charset of head 
> > tag of html, then it goes to wrong encoding and moses
> cannot 
> > understand this coding.
> > i think this is bug with perl or must use another
> function for this.
> >
> > Please any suggestion to solve this problem.
> >
> >
> >
> > musa ghurab


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] FW: web-based coding problem?

Reply via email to