Dear Wang, Here are the links to the publicly available Persian-English corpora:
- TEP: Tehran English-Persian parallel corpus, built on subtitles. It is free and you can find it here: download link<http://opus.lingfil.uu.se/download.php?f=OpenSubtitles2011/xml/en-fa.xml.gz> - ELRA-W0051, generic domain. to obtain this corpus take a look at this link <http://catalog.elra.info/product_info.php?products_id=1111>. - PEN: Parallel English-Persian News corpus, which is a small corpus built on news stories. It is not publicly available yet, but I am going to release it soon. (link to the paper<http://world-comp.org/p2011/ICA4953.pdf> ) For tokenization you can use every tokenizer available, such as the moses tokenizer. If you have more questions, feel free to ask. Regards, Amin On 04/18/2013 10:45 AM, Wang, JinPeng(AWF) wrote: Hi, everyone**** ** ** Have you got any Persian and English parallel text or related corpus links? And how to tokenize the Persian language?**** ** ** Thanks**** Regards**** ** ** Wang, JinPeng(AWF)**** eBay, Inc.**** Stubhub**** _______________________________________________ Moses-support mailing [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
