Hi, I see both these pages as UTF-8 actually. What's more they both actually declare being encoded as such. Try this at the command line: the 1st page declares its encoding in the HTTP response:
> HEAD http://www.aljazeera.net/ | grep -i ^content-type Content-Type: text/html; charset=utf-8 and the 2nd one in the HTML file: > curl -s http://www.moheet.com/ | grep -i content-type <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> So actually I would have expected the script to handle them both properly, without any modifications. My version of Firefox also displays both as UTF-8, and although I can't read Arabic it doesn't look to me like it's having encoding problems. I'd try it on my setup but I don't have Arabic models so I can't. Is your installation of the script public, can I see the URL? cheers, Herve --- On Wed, 6/8/08, musa ghurab <[EMAIL PROTECTED]> wrote: > From: musa ghurab <[EMAIL PROTECTED]> > Subject: Re: [Moses-support] web-based coding problem? > To: [email protected] > Date: Wednesday, 6 August, 2008, 6:39 PM > > i'm tring to translate first page of www.aljazeera.net > (windows-1256 but other links from main page are utf8) > > or > > any link related to the main page of www.moheet.com > (windows-1256 but main page is utf8) > > > Date: Wed, 6 Aug 2008 09:36:12 -0700> From: > [EMAIL PROTECTED]> Subject: Re: [Moses-support] > web-based coding problem?> To: > [EMAIL PROTECTED]> > Hi,> > what page are > you trying to translate?> > Herve> > > --- On > Wed, 6/8/08, musa ghurab <[EMAIL PROTECTED]> > wrote:> > > From: musa ghurab > <[EMAIL PROTECTED]>> > Subject: Re: > [Moses-support] web-based coding problem?> > To: > [email protected]> > Date: Wednesday, 6 August, > 2008, 6:33 PM> > > > my $html = decode > ('CP-1256', $res->content);> > this > function does not help, problem still there.> Date:> > > Wed, 6 Aug 2008 08:24:20 -0700> From:> > > [EMAIL PROTECTED]> Subject: Re: FW: [Moses-support]> > > web-based coding problem?> To:> > > [EMAIL PROTECTED]> CC: > [email protected]>> > > Hi all,> > > decode_entities has nothing to do with> > character > encodings. It replaces HTML entities such as> > > or é with the character they stand> > > for. Decoding is a separate process.> > > The > part> > that does t > he decoding is line 220:> > my $html => > > $res->decoded_content;> > The decoded_content> > > method (from the HTTP::Message class) uses the > character set> > declared in the HTTP response or in > the HTML file itself to> > convert bytes to > characters. If neither are present, I think> > it will > assume ISO-8859-1 as a default.> > I think> > > translate.cgi as it is works with any encoding, as long > as> > they are declared somewhere, i.e., it does not > do character> > set detection.> > > Now if > you know that the system> > will always be used for > pages in a certain encoding, you> > could override > this decoding by doing it yourself, e.g., by> > > replacing line 220 with this:> > my $html = decode> > > ('CP-1256', $res->content);> > But> > > obviously this only works if every page you serve > is> > CP-1256, or if you have any other means of > recognizing it,> > which is probably not the case.> > > > Otherwise> > you'll have to look into > character set detection. As a> > start you could look > into the Ch > arsetDetector package,> > I've never used it > myself but it looks promising:>> > > > http://search.cpan.org/perldoc?CharsetDetector> >> > > > Good luck,> Herve> > > > > -----Original> > Message-----> > From:> > > [EMAIL PROTECTED]> >> > > [mailto:[EMAIL PROTECTED]> > On Behalf > Of> > Philipp Koehn> > Sent: 06 August 2008 > 15:20> >> > To: musa ghurab> > Cc: > [email protected]> >> > Subject: Re: > [Moses-support] web-based coding problem?>> > > > > > Hi,> > > > you probably have to> > > extend the code yourself to (a) detect> > the > HTML> > page's> > encoding and (b) convert > it into UTF8> > (which should be very> > > straight-forward> > in> > Perl).> > > > > -phi> > > > On Sat,> > Aug 2, 2008 at > 5:48 PM, musa ghurab> >> > > <[EMAIL PROTECTED]> wrote:> > > Hi> > > all> > >> > > I'm facing problem > with> > the moses web-based,> > problem related > to> > encoding.> > > In web-root file: > translate.cgi> > line: 234> > >> > > >> > $html=decode_entities($html);> > > >> > >> > decode_e > ntities(page coding: windows-1256)?wrong> >> > > coding (not utf8)> > >> > > This is> > > converting the fetched text from iso coding to> > > utf8> > coding. > > > But what I got is when > fetch page> > other than utf8 such> > as Arabic > > > >> > (windows-1256> > > > (cp-1256)) or any page not> > declaring the coding in > the> > charset of head >> > > > tag of > html, then it goes to wrong encoding and> > moses> > > cannot > > > understand this> > > coding.> > > i think this is bug with perl or > must> > use another> > function for this.> > > >>> > > > Please any suggestion to > solve this problem.>> > > >> > >> > > >> > > musa> > ghurab> > > > _________________________________________________________________> > > Connect to the next generation of MSN Messenger > > > > http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline_______________________________________________> > > Moses-support mailing list> > > [email protected]> > > http://mailman.mit.edu/mailman/listinfo/moses-supp > ort > _________________________________________________________________ > News, entertainment and everything you care about at > Live.com. Get it now! > http://www.live.com/getstarted.aspx_______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
