Re: [Moses-support] Tokenization problem

Tom Hoar Wed, 14 Jan 2015 03:51:48 -0800

I don't see the problem. I get the same results with the originaltokenizer.perl script with the command line "echo" or piping from afile. I.e. no space between the apostrophe and "s"

tahoar@asus-notebook:~$ echo "which will guide you through connectingand configuring your printer's wireless connection." | tokenizer.perl -q-l enwhich will guide you through connecting and configuring your printer's wireless connection .


tahoar@asus-notebook:~$ tokenizer.perl -q -l en < test.txt

which will guide you through connecting and configuring your printer's wireless connection .which will guide you through connecting and configuring your printer's wireless connection .which will guide you through connecting and configuring your printer's wireless connection .which will guide you through connecting and configuring your printer's wireless connection .which will guide you through connecting and configuring your printer's wireless connection .


(five copies of your sentence in test.txt)



On 01/14/2015 04:37 PM, Ihab Ramadan wrote:

Dears,

I found the problem
At the line number 289 in the tokenizer.perl script just add a spacelike that
The original code

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;

The modified one

$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 '  $2/g;
By this modification tokenization of files will be the same astokenizing one segment
Thanks

*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Wednesday, January 14, 2015 11:14 AM
*To:* [email protected]
*Subject:* RE: Tokenization problem

Dears,
I still have this problem, for not confusing the decoder I used the“–no-escape” parameter in the tokenizer.perl script but still have theproblem of adding extra space after quotations for tokenizing fileshowever in tokenizing a segment it comes without the extra space
For example

In the file
“which will guide you through connecting and configuring yourprinter's wireless connection. “ à“which will guide you throughconnecting and configuring your printer ' s wireless connection .”
As a segment
“which will guide you through connecting and configuring yourprinter's wireless connection. “ à“which will guide you throughconnecting and configuring your printer 's wireless connection .”
I wonder if it is the same script why it generated two different outputs
I have no experience in perl so I could not get the line of code whichdiffer between if the segment in a file or just one segment passed asa parameter to the script
Please help

*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* [email protected] <mailto:[email protected]>
*Subject:* Tokenization problem

Dears,
Using the tokenizer on the training files replaces the apostropheswith “' s” (with space) but if I use the same script to tokenizea sentence it makes the apostrophes to be “'s” (without a space)
This problem confuse the decoder while translation

How to solve this peoblem

Thanks

Best Regards
/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |Fax+20233032036 | *Follow us on *linked<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |**ZA102637861*<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |**ZA102637858* <https://twitter.com/Saudisoft>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenization problem

Reply via email to