[Mt-list] Official release of the United Nations Parallel Corpus v1.0

Marcin Junczys-Dowmunt Fri, 27 May 2016 06:19:34 -0700

Dear all,

I would like to announce the official release of the United NationsParallel Corpus v1.0. The corpus was created as part of the UnitedNations commitment to multilingualism and as a reaction to the growingimportance of statistical machine translation (SMT) within theDepartment for General Assembly and Conference Management (DGACM)translation services and the United Nations. It covers 25 years, from1990 to 2014, and contains documents in the six official languages ofthe United Nations: Arabic, Chinese, English, French, Russian, and Spanish.

The purpose of the corpus is to allow access to multilingual languageresources and facilitate research and progress in various naturallanguage processing tasks, including machine translation. Forconvenience, the corpus is also available pre-packaged as bi-texts foreach language pair.

A subset of the corpus is available as a six-language fully-parallelcorpus, i.e. all sentences have equivalents in all six languages. Datafrom 2015 has been used to created official development sets and testsets, also fully aligned across the six official UN languages. The paperreports SMT baselines for all languages pairs for this corpus.


The corpus is available at:

http://conferences.unite.un.org/UNCorpus

The corresponding publication is available at:

http://www.lrec-conf.org/proceedings/lrec2016/pdf/1195_Paper.pdf

While registering, please leave a short description of the work forwhich you plan to use the corpus. In the near future we plan to set up asection with references to papers that describe research done with UNcorpus. Feel free to share links and bibliography items with us (eitherwith me or any of the authors of the above paper).


Sorry for cross-posting,
Marcin Junczys-Dowmunt

_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

[Mt-list] Official release of the United Nations Parallel Corpus v1.0

Reply via email to