I don't see the problem. I get the same results with the original tokenizer.perl script with the command line "echo" or piping from a file. I.e. no space between the apostrophe and "s"

tahoar@asus-notebook:~$ echo "which will guide you through connecting and configuring your printer's wireless connection." | tokenizer.perl -q -l en which will guide you through connecting and configuring your printer 's wireless connection .

tahoar@asus-notebook:~$ tokenizer.perl -q -l en < test.txt
which will guide you through connecting and configuring your printer &apos;s wireless connection . which will guide you through connecting and configuring your printer &apos;s wireless connection . which will guide you through connecting and configuring your printer &apos;s wireless connection . which will guide you through connecting and configuring your printer &apos;s wireless connection . which will guide you through connecting and configuring your printer &apos;s wireless connection .

(five copies of your sentence in test.txt)



On 01/14/2015 04:37 PM, Ihab Ramadan wrote:

Dears,

I found the problem

At the line number 289 in the tokenizer.perl script just add a space like that

The original code

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;

The modified one

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '  $2/g;

By this modification tokenization of files will be the same as tokenizing one segment

Thanks

*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Wednesday, January 14, 2015 11:14 AM
*To:* [email protected]
*Subject:* RE: Tokenization problem

Dears,

I still have this problem, for not confusing the decoder I used the “–no-escape” parameter in the tokenizer.perl script but still have the problem of adding extra space after quotations for tokenizing files however in tokenizing a segment it comes without the extra space

For example

In the file

“which will guide you through connecting and configuring your printer's wireless connection. “ à“which will guide you through connecting and configuring your printer ' s wireless connection .”

As a segment

“which will guide you through connecting and configuring your printer's wireless connection. “ à“which will guide you through connecting and configuring your printer 's wireless connection .”

I wonder if it is the same script why it generated two different outputs

I have no experience in perl so I could not get the line of code which differ between if the segment in a file or just one segment passed as a parameter to the script

Please help

*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* [email protected] <mailto:[email protected]>
*Subject:* Tokenization problem

Dears,

Using the tokenizer on the training files replaces the apostrophes with “&apos; s” (with space) but if I use the same script to tokenize a sentence it makes the apostrophes to be “&apos;s” (without a space)

This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards

/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/> - Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 | Fax+20233032036 | *Follow us on *linked <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* | **ZA102637861* <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* | **ZA102637858* <https://twitter.com/Saudisoft>



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to