[Mt-list] PMIndia - A Collection of Parallel Corpora of Languages of India

Barry Haddow Wed, 29 Jan 2020 03:56:10 -0800

Hi All

We have released a new sentence aligned corpora pairing English with 13different languages spoken in India. Up to 56k sentence pairs areavailable for each pair. The languages of India contained in the corporaare Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri,Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a largerversion of the corpus, document-aligned only.


The corpus is available here: http://data.statmt.org/pmindia/

There is an accompanying paper which describes the construction of thecorpus, a comparison of alignment methods, and some initial MT results.


https://arxiv.org/abs/2001.09907


Barry Haddow and Faheem Kirefu




--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

[Mt-list] PMIndia - A Collection of Parallel Corpora of Languages of India

Reply via email to