*Shared Task: Parallel Corpus Filtering*
at the Third Conference on Machine Translation (WMT18)

This new shared task tackles the problem of cleaning noisy parallel
corpora. Given a noisy parallel corpus (crawled from the web), participants
develop methods to filter it to a smaller size of high quality sentence

We provide a very noisy 1 billion word (English token count) German-English
corpus crawled from the web as part of the Paracrawlproject. We ask
participants to subselect sentence pairs that amount to (a) 100 million
words, and (b) 10 million words. The quality of the resulting subsets is
determined by the quality of a statstical machine translation (Moses,
phrase-based) and neural machine translation system (Marian) trained on
this data. The quality of the machine translation system is measured by
BLEU score on the (a) official WMT 2018 news translation test set and (b)
another undisclosed test set.

Release of raw parallel data: April 1, 2018
Submission deadline for subsampled sets: June 22, 2018
System descriptions due: July 6, 2018
Announcement of results: July 9, 2018
Camera-ready for system descriptions: July 27, 2018

Philipp Koehn (Johns Hopkins University / University of Edinburgh)
Huda Khayrallah (Johns Hopkins University)
Kenneth Heafield (University of Edinburgh)
Mikel Forcada (University of Alicante)

This shared task is partially supported by a Google Faculty Research Award
and the Connecting Europe Facility via the Paracrawl project.
UNSUBSCRIBE from this page:
Corpora mailing list

Reply via email to