I don't see the problem. I get the same results with the original
tokenizer.perl script with the command line "echo" or piping from a
file. I.e. no space between the apostrophe and "s"
tahoar@asus-notebook:~$ echo "which will guide you through connecting
and configuring your printer's wireless connection." | tokenizer.perl -q
-l en
which will guide you through connecting and configuring your printer
's wireless connection .
tahoar@asus-notebook:~$ tokenizer.perl -q -l en < test.txt
which will guide you through connecting and configuring your printer
's wireless connection .
which will guide you through connecting and configuring your printer
's wireless connection .
which will guide you through connecting and configuring your printer
's wireless connection .
which will guide you through connecting and configuring your printer
's wireless connection .
which will guide you through connecting and configuring your printer
's wireless connection .
(five copies of your sentence in test.txt)
On 01/14/2015 04:37 PM, Ihab Ramadan wrote:
Dears,
I found the problem
At the line number 289 in the tokenizer.perl script just add a space
like that
The original code
$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;
The modified one
$text =~ s/([\p{IsAlpha}])[']([\p{IsAlpha}])/$1 ' $2/g;
By this modification tokenization of files will be the same as
tokenizing one segment
Thanks
*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Wednesday, January 14, 2015 11:14 AM
*To:* [email protected]
*Subject:* RE: Tokenization problem
Dears,
I still have this problem, for not confusing the decoder I used the
“–no-escape” parameter in the tokenizer.perl script but still have the
problem of adding extra space after quotations for tokenizing files
however in tokenizing a segment it comes without the extra space
For example
In the file
“which will guide you through connecting and configuring your
printer's wireless connection. “ à“which will guide you through
connecting and configuring your printer ' s wireless connection .”
As a segment
“which will guide you through connecting and configuring your
printer's wireless connection. “ à“which will guide you through
connecting and configuring your printer 's wireless connection .”
I wonder if it is the same script why it generated two different outputs
I have no experience in perl so I could not get the line of code which
differ between if the segment in a file or just one segment passed as
a parameter to the script
Please help
*From:*Ihab Ramadan [mailto:[email protected]]
*Sent:* Monday, January 5, 2015 10:09 AM
*To:* [email protected] <mailto:[email protected]>
*Subject:* Tokenization problem
Dears,
Using the tokenizer on the training files replaces the apostrophes
with “' s” (with space) but if I use the same script to tokenize
a sentence it makes the apostrophes to be “'s” (without a space)
This problem confuse the decoder while translation
How to solve this peoblem
Thanks
Best Regards
/Ihab Ramadan/| Senior Developer|Saudisoft <http://www.saudisoft.com/>
- Egypt| *Tel * +2 02 330 320 37 Ext- 0| Mob+201007570826 |
Fax+20233032036 | *Follow us on *linked
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* |
**ZA102637861*
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* |
**ZA102637858* <https://twitter.com/Saudisoft>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support