Re: [Moses-support] web-based coding problem?

Herve Saint-Amand Wed, 06 Aug 2008 10:03:49 -0700

Hi,

I see both these pages as UTF-8 actually. What's more they both actually 
declare being encoded as such. Try this at the command line: the 1st page 
declares its encoding in the HTTP response:


> HEAD http://www.aljazeera.net/ | grep -i ^content-type
Content-Type: text/html; charset=utf-8

and the 2nd one in the HTML file:

> curl -s http://www.moheet.com/ | grep -i content-type
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">


So actually I would have expected the script to handle them both properly, 
without any modifications. My version of Firefox also displays both as UTF-8, 
and although I can't read Arabic it doesn't look to me like it's having 
encoding problems.

I'd try it on my setup but I don't have Arabic models so I can't. Is your 
installation of the script public, can I see the URL?


cheers,
Herve


--- On Wed, 6/8/08, musa ghurab <[EMAIL PROTECTED]> wrote:

> From: musa ghurab <[EMAIL PROTECTED]>
> Subject: Re: [Moses-support] web-based coding problem?
> To: [email protected]
> Date: Wednesday, 6 August, 2008, 6:39 PM
> 
> i'm tring to translate first page of www.aljazeera.net 
> (windows-1256 but other links from main page are utf8)
>  
> or 
>  
> any link related to the main page of www.moheet.com
> (windows-1256 but main page is utf8)
>  
> > Date: Wed, 6 Aug 2008 09:36:12 -0700> From:
> [EMAIL PROTECTED]> Subject: Re: [Moses-support]
> web-based coding problem?> To:
> [EMAIL PROTECTED]> > Hi,> > what page are
> you trying to translate?> > Herve> > > --- On
> Wed, 6/8/08, musa ghurab <[EMAIL PROTECTED]>
> wrote:> > > From: musa ghurab
> <[EMAIL PROTECTED]>> > Subject: Re:
> [Moses-support] web-based coding problem?> > To:
> [email protected]> > Date: Wednesday, 6 August,
> 2008, 6:33 PM> > > > my $html = decode
> ('CP-1256', $res->content);> > this
> function does not help, problem still there.> Date:>
> > Wed, 6 Aug 2008 08:24:20 -0700> From:> >
> [EMAIL PROTECTED]> Subject: Re: FW: [Moses-support]>
> > web-based coding problem?> To:> >
> [EMAIL PROTECTED]> CC:
> [email protected]>> > > Hi all,> >
> decode_entities has nothing to do with> > character
> encodings. It replaces HTML entities such as> >
> &nbsp; or &eacute; with the character they stand>
> > for. Decoding is a separate process.> > > The
> part> > that does t
>  he decoding is line 220:> > my $html => >
> $res->decoded_content;> > The decoded_content>
> > method (from the HTTP::Message class) uses the
> character set> > declared in the HTTP response or in
> the HTML file itself to> > convert bytes to
> characters. If neither are present, I think> > it will
> assume ISO-8859-1 as a default.> > I think> >
> translate.cgi as it is works with any encoding, as long
> as> > they are declared somewhere, i.e., it does not
> do character> > set detection.> > > Now if
> you know that the system> > will always be used for
> pages in a certain encoding, you> > could override
> this decoding by doing it yourself, e.g., by> >
> replacing line 220 with this:> > my $html = decode>
> > ('CP-1256', $res->content);> > But>
> > obviously this only works if every page you serve
> is> > CP-1256, or if you have any other means of
> recognizing it,> > which is probably not the case.>
> > > Otherwise> > you'll have to look into
> character set detection. As a> > start you could look
> into the Ch
>  arsetDetector package,> > I've never used it
> myself but it looks promising:>> > >
> http://search.cpan.org/perldoc?CharsetDetector> >>
> > > Good luck,> Herve> > > >
> -----Original> > Message-----> > From:> >
> [EMAIL PROTECTED]> >> >
> [mailto:[EMAIL PROTECTED]> > On Behalf
> Of> > Philipp Koehn> > Sent: 06 August 2008
> 15:20> >> > To: musa ghurab> > Cc:
> [email protected]> >> > Subject: Re:
> [Moses-support] web-based coding problem?>> > >
> > > Hi,> > > > you probably have to>
> > extend the code yourself to (a) detect> > the
> HTML> > page's> > encoding and (b) convert
> it into UTF8> > (which should be very> >
> straight-forward> > in> > Perl).> > >
> > -phi> > > > On Sat,> > Aug 2, 2008 at
> 5:48 PM, musa ghurab> >> >
> <[EMAIL PROTECTED]> wrote:> > > Hi>
> > all> > >> > > I'm facing problem
> with> > the moses web-based,> > problem related
> to> > encoding.> > > In web-root file:
> translate.cgi> > line: 234> > >> >
> >> > $html=decode_entities($html);> >
> >> > >> > decode_e
>  ntities(page coding: windows-1256)?wrong> >> >
> coding (not utf8)> > >> > > This is>
> > converting the fetched text from iso coding to> >
> utf8> > coding. > > > But what I got is when
> fetch page> > other than utf8 such> > as Arabic
> > > >> > (windows-1256> > >
> (cp-1256)) or any page not> > declaring the coding in
> the> > charset of head >> > > > tag of
> html, then it goes to wrong encoding and> > moses>
> > cannot > > > understand this> >
> coding.> > > i think this is bug with perl or
> must> > use another> > function for this.>
> > >>> > > > Please any suggestion to
> solve this problem.>> > > >> > >>
> > >> > > musa> > ghurab> > >
> _________________________________________________________________>
> > Connect to the next generation of MSN Messenger >
> >
> http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline_______________________________________________>
> > Moses-support mailing list> >
> [email protected]> >
> http://mailman.mit.edu/mailman/listinfo/moses-supp
>  ort
> _________________________________________________________________
> News, entertainment and everything you care about at
> Live.com. Get it now!
> http://www.live.com/getstarted.aspx_______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] web-based coding problem?

Reply via email to