Good catch, Ken. I see your point, For example, considering the likely language pair (EN-AR), there could be some non-printing characters in the text file that the copy/paste clipboard drops.
On 01/15/2015 08:39 AM, Kenneth Heafield wrote: > I'll inject that it is plausible there is some weird Unicode going on > there and copy-paste on Linux sometimes canonicalized graphemes. Whilst > I'm inclined to side with Tom, the only way to sort this out is with the > raw file from Ihab as e.g. a gzipped attachment. > > Kenneth > > On 01/14/2015 08:33 PM, Tom Hoar wrote: >> I just ran the same sentence through the newest github clone (today). >> >> corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$ >> ./tokenizer.perl -no-escape -q -l en < test.txt >> which will guide you through connecting and configuring your printer 's >> wireless connection . >> which will guide you through connecting and configuring your printer 's >> wireless connection . >> which will guide you through connecting and configuring your printer 's >> wireless connection . >> which will guide you through connecting and configuring your printer 's >> wireless connection . >> which will guide you through connecting and configuring your printer 's >> wireless connection . >> >> This is not a Perl script problem. What shell and command line are you >> using for your "in the file" results? You'll find the problem in either >> your shell or your custom tool chain(s) before you run tokenizer.perl. >> >> >> >> On 01/14/2015 04:13 PM, Ihab Ramadan wrote: >>> Dears, >>> >>> I still have this problem, for not confusing the decoder I used the >>> “–no-escape” parameter in the tokenizer.perl script but still have the >>> problem of adding extra space after quotations for tokenizing files >>> however in tokenizing a segment it comes without the extra space >>> >>> For example >>> >>> In the file >>> >>> “which will guide you through connecting and configuring your >>> printer's wireless connection. “ à“which will guide you through >>> connecting and configuring your printer ' s wireless connection .” >>> >>> As a segment >>> >>> “which will guide you through connecting and configuring your >>> printer's wireless connection. “ à“which will guide you through >>> connecting and configuring your printer 's wireless connection .” >>> >>> I wonder if it is the same script why it generated two different outputs >>> >>> I have no experience in perl so I could not get the line of code which >>> differ between if the segment in a file or just one segment passed as >>> a parameter to the script >>> >>> Please help >>> >>> >>> >>> >>> >>> >>> >>> *From:*Ihab Ramadan [mailto:[email protected]] >>> *Sent:* Monday, January 5, 2015 10:09 AM >>> *To:* [email protected] >>> *Subject:* Tokenization problem >>> >>> >>> >>> Dears, >>> >>> Using the tokenizer on the training files replaces the apostrophes >>> with “' s” (with space) but if I use the same script to tokenize >>> a sentence it makes the apostrophes to be “'s” (without a space) >>> >>> This problem confuse the decoder while translation >>> >>> How to solve this peoblem >>> >>> Thanks >>> >>> >>> >>> Best Regards >>> >>> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> >>> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | >>> Fax+20233032036 | *Follow us on *linked >>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>> | >>> **ZA102637861* >>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>> | >>> **ZA102637858* <https://twitter.com/Saudisoft> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
