Hi Hieu,

Should I make tokenization and truecasing for both corpus file and parallel 
files or just for parallel files only?

Thanks 

 

From: [email protected] [mailto:[email protected]] On Behalf Of Hieu Hoang
Sent: Monday, November 3, 2014 8:18 PM
To: [email protected]
Cc: moses-support
Subject: Re: [Moses-support] Tokenization issue

 

hi ihab

at it's most basic, tokenization separates punctuations from words. However, it 
can also be used to separate a word into it's morphemes to make it easier to 
process.

Moses doesn't include a very good Arabic tokeniser. Each language needs a 
nonbreaking_prefix file, located in 
   scripts/share/nonbreaking_prefixes

This doesn't exist for arabic, so the tokenizer uses the English file instead.

If you create a nonbreaking_prefixes for arabic, please share it with us. Or 
use a tool like MADA to tokenizer your arabic data

 

On 28 October 2014 14:40, Ihab Ramadan <[email protected]> wrote:

Dears,

I have misunderstanding on what tokenization really do 

What I think that It makes the translation of  text like translated text gives 
the same output as “translated” text or translated.text or translated text . 
which ignores any punctuations in the translated text

Am I right ?

I did the tokenization on my data but this is not happening 

Note : in the tokenizer script I should feed it with the language and it could 
not recognize the arabic language (ar) which is my target language 

 

Best Regards

Ihab Ramadan| Senior Developer|  <http://www.saudisoft.com/> Saudisoft - Egypt 
| Tel  +2 02 330 320 37  Ext- 0 | Mob+201007570826 <tel:%2B201007570826>  | 
Fax+20233032036 <tel:%2B20233032036>  | Follow us on  
<http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>
 linked |  
<https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>
 ZA102637861 |  <https://twitter.com/Saudisoft> ZA102637858

 


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support




-- 

Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to