I'll inject that it is plausible there is some weird Unicode going on there and copy-paste on Linux sometimes canonicalized graphemes. Whilst I'm inclined to side with Tom, the only way to sort this out is with the raw file from Ihab as e.g. a gzipped attachment.
Kenneth On 01/14/2015 08:33 PM, Tom Hoar wrote: > I just ran the same sentence through the newest github clone (today). > > corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ > ./tokenizer.perl -no-escape -q -l en < test.txt > which will guide you through connecting and configuring your printer 's > wireless connection . > which will guide you through connecting and configuring your printer 's > wireless connection . > which will guide you through connecting and configuring your printer 's > wireless connection . > which will guide you through connecting and configuring your printer 's > wireless connection . > which will guide you through connecting and configuring your printer 's > wireless connection . > > This is not a Perl script problem. What shell and command line are you > using for your "in the file" results? You'll find the problem in either > your shell or your custom tool chain(s) before you run tokenizer.perl. > > > > On 01/14/2015 04:13 PM, Ihab Ramadan wrote: >> >> Dears, >> >> I still have this problem, for not confusing the decoder I used the >> “–no-escape” parameter in the tokenizer.perl script but still have the >> problem of adding extra space after quotations for tokenizing files >> however in tokenizing a segment it comes without the extra space >> >> For example >> >> In the file >> >> “which will guide you through connecting and configuring your >> printer's wireless connection. “ à“which will guide you through >> connecting and configuring your printer ' s wireless connection .” >> >> As a segment >> >> “which will guide you through connecting and configuring your >> printer's wireless connection. “ à“which will guide you through >> connecting and configuring your printer 's wireless connection .” >> >> I wonder if it is the same script why it generated two different outputs >> >> I have no experience in perl so I could not get the line of code which >> differ between if the segment in a file or just one segment passed as >> a parameter to the script >> >> Please help >> >> >> >> >> >> >> >> *From:*Ihab Ramadan [mailto:[email protected]] >> *Sent:* Monday, January 5, 2015 10:09 AM >> *To:* [email protected] >> *Subject:* Tokenization problem >> >> >> >> Dears, >> >> Using the tokenizer on the training files replaces the apostrophes >> with “' s” (with space) but if I use the same script to tokenize >> a sentence it makes the apostrophes to be “'s” (without a space) >> >> This problem confuse the decoder while translation >> >> How to solve this peoblem >> >> Thanks >> >> >> >> Best Regards >> >> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> >> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | >> Fax+20233032036 | *Follow us on *linked >> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >> | >> **ZA102637861* >> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >> | >> **ZA102637858* <https://twitter.com/Saudisoft> >> >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
