[Apologies for multiple posting] Dear all,
The period for submitting resources for the shared task on "Multilingual Low-Resource Translation for Indo-European Languages" ended yesterday. Please, visit the webpage if you are interested in participating and training your multilingual MT system! http://www.statmt.org/wmt21/multilingualHeritage-translation-task.html Best, Cristina November 10-11, 2021 Punta Cana, Dominican Republic (ONLINE) Shared Task: Multilingual Low-Resource Translation for Indo-European Languages TASK DESCRIPTION Massively multilingual machine translation has shown impressive capabilities, including zero and few-shot translation of low-resource languages. However, these models are often evaluated from or into English, where the most data is available, and assuming that models generalise to other pairs and low-resource languages. In the first edition of the Multilingual Low-Resource Translation shared task, we focus on multilinguality in the cultural heritage domain for two Indo-European language families: North-Germanic and Romance. We want to explore how information in one language can be transferred to other related languages and, for this, we evaluate translation quality in low-resourced language pairs, but explicitly encourage the use of data of the high-resourced language-pairs in the same family. We would like to answer the question: do we need English and/or Spanish for high quality translation of related languages? If so, which is the best way to combine the data: pivot, cascade, multilingual, via pre-training? To explore the topic, we present two tasks with different characteristics, one per language family (see below). Even if data in the highest resourced languages is allowed and encouraged for training, translation quality will be only evaluated on the other pairs. Participants are also encouraged to make available to the community any additional resources useful for the task. Subtasks - *Task 1. Europeana thesis abstracts and descriptions*. North-Germanic languages: *from/to Icelandic, Norwegian Bokmål and Swedish*. Danish, German and English data is allowed for training but translation is not evaluated. - *Task 2. Wikipedia cultural heritage articles*. Romance languages: *from Catalan to Occitan, Romanian and Italian*. Spanish, French and Portuguese data (+ English!) is allowed for training but translation is not evaluated. EVALUATION *Task 1. Europeana thesis abstracts and descriptions.* We will provide a test set with a similar amount of abstracts in the 3 languages that have to be translated to the other 2. The data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language. *Task 2. Wikipedia cultural heritage articles.* We will provide a test set with articles in Catalan that need to be translated into the other 3 languages. Similarly to Task 1, the data will be in xml format with information of the source language and document boundaries. The validation set has exactly the same structure. All data has been translated by professional translators, being the source language the original language. *Automatic evaluation* results will be reported per language pair and in average. The final ranking will be done according to the average translation quality per subtask, that is, per family and not per pair. When possible, we will also perform *human evaluation at document level* on a subset of paragraphs/sentences. If you plan to participate, please, contact us so that we can plan the human evaluation. Since the official evaluation will be done per family, participants can participate in only one of the tasks if preferred, but they are expected to submit translations for all the languages involved in the task. IMPORTANT DATES Release of initial training data April 5, 2021 Additional training data deadline May 19*, *2021 Release of test data *June 29*, 2021 Results submission deadline July 6, 2021 Paper submission deadline August 5, 2021 Paper notification September 5, 2021 Camera-ready version due September 15, 2021 Conference at EMNLP November 10-11, 2021 ORGANISERS Anastasija Amann (DFKI GmbH, Germany) Kwabena Amponsah-Kaakyire (DFKI GmbH, Germany) Cristina España-Bonet (DFKI GmbH, Germany) Josef van Genabith (DFKI GmbH, Germany) Leonie Harter (DFKI GmbH, Germany) Contact Interested in the task? Join the WMT google group for questions or comments and/or email cristinae aatt dfki.de.
_______________________________________________ Mt-list site list Mt-list@eamt.org http://lists.eamt.org/mailman/listinfo/mt-list