Re: [Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

Thoudam Doren Singh Wed, 29 Jan 2020 06:44:11 -0800

Hi Barry

Good job. For some language pairs below 10k, it's quite appealing BLEU
scores as reported.



Best Regards


Doren



On Wednesday, January 29, 2020, Barry Haddow <[email protected]> wrote:

> Hi All
>
> We have released a new sentence aligned corpora pairing English with 13
> different languages spoken in India. Up to 56k sentence pairs are
> available for each pair. The languages of India contained in the corpora
> are Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Manipuri,
> Marathi, Odia, Punjabi, Tamil, Telugu and Urdu. We also provide a larger
> version of the corpus, document-aligned only.
>
> The corpus is available here: http://data.statmt.org/pmindia/
>
> There is an accompanying paper which describes the construction of the
> corpus, a comparison of alignment methods, and some initial MT results.
>
> https://arxiv.org/abs/2001.09907
>
>
> Barry Haddow and Faheem Kirefu
>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] PMIndia - A Collection of Parallel Corpora of Languages of India

Reply via email to