Hi All
We have released a new sentence aligned corpora pairing English with 13
different languages spoken in India. Up to 56k sentence pairs are
available for each pair. The languages of India contained in the corpora
are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri,
Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger
version of the corpus, document-aligned only.
The corpus is available here: http://data.statmt.org/pmindia/
There is an accompanying paper which describes the construction of the
corpus, a comparison of alignment methods, and some initial MT results.
https://arxiv.org/abs/2001.09907
Barry Haddow and Faheem Kirefu
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list