Many thanks for all of you As you mentioned the problem is not in the script it was in the text sent to the terminal from my web app, I found that some characters does not goes as it with weird Unicode Thanks everybody
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of [email protected] Sent: Thursday, January 15, 2015 3:39 AM To: [email protected] Subject: Moses-support Digest, Vol 99, Issue 28 Send Moses-support mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://mailman.mit.edu/mailman/listinfo/moses-support or, via email, send a message with subject or body 'help' to [email protected] You can reach the person managing the list at [email protected] When replying, please edit your Subject line so it is more specific than "Re: Contents of Moses-support digest..." Today's Topics: 1. how to align some new parallel sentences using a trained model (iamzcy_hit iamzcy_hit) 2. Re: Tokenization problem (Tom Hoar) 3. Re: Tokenization problem (Kenneth Heafield) ---------------------------------------------------------------------- Message: 1 Date: Thu, 15 Jan 2015 08:54:06 +0800 From: iamzcy_hit iamzcy_hit <[email protected]> Subject: [Moses-support] how to align some new parallel sentences using a trained model To: "[email protected]" <[email protected]> Message-ID: <CAGLowvLWHXb_J+=vZqMeOVCOD7Z=Uzyz_Sn=yjv+ptsfsyv...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Hi,all If I've train a alignment model using a huge parallel corpus with the help of giga++,mgiga or fast-align, now I am given some new sentences pairs and want to align the words in the sentence, how should I do ? Best regards -- ???????????????..... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/9f 3850f8/attachment-0001.htm ------------------------------ Message: 2 Date: Thu, 15 Jan 2015 08:33:17 +0700 From: Tom Hoar <[email protected]> Subject: Re: [Moses-support] Tokenization problem To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset="windows-1252" I just ran the same sentence through the newest github clone (today). corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ ./tokenizer.perl -no-escape -q -l en < test.txt which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . which will guide you through connecting and configuring your printer 's wireless connection . This is not a Perl script problem. What shell and command line are you using for your "in the file" results? You'll find the problem in either your shell or your custom tool chain(s) before you run tokenizer.perl. On 01/14/2015 04:13 PM, Ihab Ramadan wrote: > > Dears, > > I still have this problem, for not confusing the decoder I used the > ??no-escape? parameter in the tokenizer.perl script but still have the > problem of adding extra space after quotations for tokenizing files > however in tokenizing a segment it comes without the extra space > > For example > > In the file > > ?which will guide you through connecting and configuring your > printer's wireless connection. ? ??which will guide you through > connecting and configuring your printer ' s wireless connection .? > > As a segment > > ?which will guide you through connecting and configuring your > printer's wireless connection. ? ??which will guide you through > connecting and configuring your printer 's wireless connection .? > > I wonder if it is the same script why it generated two different > outputs > > I have no experience in perl so I could not get the line of code which > differ between if the segment in a file or just one segment passed as > a parameter to the script > > Please help > > *From:*Ihab Ramadan [mailto:[email protected]] > *Sent:* Monday, January 5, 2015 10:09 AM > *To:* [email protected] > *Subject:* Tokenization problem > > Dears, > > Using the tokenizer on the training files replaces the apostrophes > with ?' s? (with space) but if I use the same script to tokenize > a sentence it makes the apostrophes to be ?'s? (without a space) > > This problem confuse the decoder while translation > > How to solve this peoblem > > Thanks > > Best Regards > > /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> > - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | > Fax+20233032036 | *Follow us on *linked > <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trk > Info=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVS > RPcmpt%3Aprimary>* | > **ZA102637861* > <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_t > ype=bookmark>* | > **ZA102637858* <https://twitter.com/Saudisoft> > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84 784716/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1314 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84 784716/attachment-0003.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1317 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84 784716/attachment-0004.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1351 bytes Desc: not available Url : http://mailman.mit.edu/mailman/private/moses-support/attachments/20150115/84 784716/attachment-0005.gif ------------------------------ Message: 3 Date: Wed, 14 Jan 2015 20:39:14 -0500 From: Kenneth Heafield <[email protected]> Subject: Re: [Moses-support] Tokenization problem To: [email protected] Message-ID: <[email protected]> Content-Type: text/plain; charset=windows-1252 I'll inject that it is plausible there is some weird Unicode going on there and copy-paste on Linux sometimes canonicalized graphemes. Whilst I'm inclined to side with Tom, the only way to sort this out is with the raw file from Ihab as e.g. a gzipped attachment. Kenneth On 01/14/2015 08:33 PM, Tom Hoar wrote: > I just ran the same sentence through the newest github clone (today). > > corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ > ./tokenizer.perl -no-escape -q -l en < test.txt which will guide you > through connecting and configuring your printer 's wireless connection > . > which will guide you through connecting and configuring your printer > 's wireless connection . > which will guide you through connecting and configuring your printer > 's wireless connection . > which will guide you through connecting and configuring your printer > 's wireless connection . > which will guide you through connecting and configuring your printer > 's wireless connection . > > This is not a Perl script problem. What shell and command line are you > using for your "in the file" results? You'll find the problem in > either your shell or your custom tool chain(s) before you run tokenizer.perl. > > > > On 01/14/2015 04:13 PM, Ihab Ramadan wrote: >> >> Dears, >> >> I still have this problem, for not confusing the decoder I used the >> ??no-escape? parameter in the tokenizer.perl script but still have >> the problem of adding extra space after quotations for tokenizing >> files however in tokenizing a segment it comes without the extra >> space >> >> For example >> >> In the file >> >> ?which will guide you through connecting and configuring your >> printer's wireless connection. ? ??which will guide you through >> connecting and configuring your printer ' s wireless connection .? >> >> As a segment >> >> ?which will guide you through connecting and configuring your >> printer's wireless connection. ? ??which will guide you through >> connecting and configuring your printer 's wireless connection .? >> >> I wonder if it is the same script why it generated two different >> outputs >> >> I have no experience in perl so I could not get the line of code >> which differ between if the segment in a file or just one segment >> passed as a parameter to the script >> >> Please help >> >> >> >> >> >> >> >> *From:*Ihab Ramadan [mailto:[email protected]] >> *Sent:* Monday, January 5, 2015 10:09 AM >> *To:* [email protected] >> *Subject:* Tokenization problem >> >> >> >> Dears, >> >> Using the tokenizer on the training files replaces the apostrophes >> with ?' s? (with space) but if I use the same script to tokenize >> a sentence it makes the apostrophes to be ?'s? (without a space) >> >> This problem confuse the decoder while translation >> >> How to solve this peoblem >> >> Thanks >> >> >> >> Best Regards >> >> /Ihab Ramadan/| Senior Developer|Saudisoft >> <http://www.saudisoft.com/> >> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | >> Fax+20233032036 | *Follow us on *linked >> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&tr >> kInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2C >> VSRPcmpt%3Aprimary>* | >> **ZA102637861* >> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_ >> type=bookmark>* | >> **ZA102637858* <https://twitter.com/Saudisoft> >> >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > ------------------------------ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support End of Moses-support Digest, Vol 99, Issue 28 ********************************************* _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
