Re: [Moses-support] Tokenization problem

Kenneth Heafield Wed, 14 Jan 2015 17:41:29 -0800

I'll inject that it is plausible there is some weird Unicode going on
there and copy-paste on Linux sometimes canonicalized graphemes.  Whilst
I'm inclined to side with Tom, the only way to sort this out is with the
raw file from Ihab as e.g. a gzipped attachment.


Kenneth

On 01/14/2015 08:33 PM, Tom Hoar wrote:
> I just ran the same sentence through the newest github clone (today).
> 
> corporamgr@domt-v2:~/Public/src/mosesdecoder/scripts/tokenizer$
> ./tokenizer.perl -no-escape -q -l en < test.txt
> which will guide you through connecting and configuring your printer 's
> wireless connection .
> which will guide you through connecting and configuring your printer 's
> wireless connection .
> which will guide you through connecting and configuring your printer 's
> wireless connection .
> which will guide you through connecting and configuring your printer 's
> wireless connection .
> which will guide you through connecting and configuring your printer 's
> wireless connection .
> 
> This is not a Perl script problem. What shell and command line are you
> using for your "in the file" results? You'll find the problem in either
> your shell or your custom tool chain(s) before you run tokenizer.perl.
> 
> 
> 
> On 01/14/2015 04:13 PM, Ihab Ramadan wrote:
>>
>> Dears,
>>
>> I still have this problem, for not confusing the decoder I used the
>> “–no-escape” parameter in the tokenizer.perl script but still have the
>> problem of adding extra space after quotations for tokenizing files
>> however in tokenizing a segment it comes without the extra space
>>
>> For example
>>
>> In the file
>>
>> “which will guide you through connecting and configuring your
>> printer's wireless connection. “ à“which will guide you through
>> connecting and configuring your printer ' s wireless connection .”
>>
>> As a segment
>>
>> “which will guide you through connecting and configuring your
>> printer's wireless connection. “ à“which will guide you through
>> connecting and configuring your printer 's wireless connection .”
>>
>> I wonder if it is the same script why it generated two different outputs
>>
>> I have no experience in perl so I could not get the line of code which
>> differ between if the segment in a file or just one segment passed as
>> a parameter to the script
>>
>> Please help
>>
>>  
>>
>>  
>>
>>  
>>
>> *From:*Ihab Ramadan [mailto:[email protected]]
>> *Sent:* Monday, January 5, 2015 10:09 AM
>> *To:* [email protected]
>> *Subject:* Tokenization problem
>>
>>  
>>
>> Dears,
>>
>> Using the tokenizer on the training files replaces the apostrophes
>> with “&apos; s” (with space) but if I use the same script to tokenize
>> a sentence it makes the apostrophes to be “&apos;s” (without a space)
>>
>> This problem confuse the decoder while translation
>>
>> How to solve this peoblem
>>
>> Thanks  
>>
>>  
>>
>> Best Regards
>>
>> /Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
>> - Egypt| *Tel * +2 02 330 320 37  Ext- 0| Mob+201007570826 |
>> Fax+20233032036 | *Follow us on *linked
>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>*
>>  |
>> **ZA102637861*
>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>*
>>  |
>> **ZA102637858* <https://twitter.com/Saudisoft>
>>
>>  
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

Reply via email to