WMT 2016 Shared Task on Cross-lingual Pronoun Prediction

CALL FOR PARTICIPATION

========================================================
WMT 2016 Shared Task on Cross-lingual Pronoun Prediction
========================================================

Website: http://www.statmt.org/wmt16/pronoun-task.html
At WMT 2016 (collocated with ACL 2016)

We are pleased to announce an exciting cross-lingual pronoun prediction task 
for people interested in (discourse-aware) machine translation, anaphora 
resolution and machine learning in general.

In the cross-lingual pronoun prediction task, participants are asked to predict 
a target-language pronoun given a source-language pronoun in the context of a 
sentence. For example, in the English-to-French sub-task, to predict the 
correct translation of "it" or "they" into French (ce, elle, elles, il, ils, 
ça, cela, on, OTHER). You may use any type of information that can be extracted 
from the documents. We provide training and development data and a simple 
baseline system using an N-gram language model.

Participants are invited to submit systems for the English-French and 
English-German language pairs, for both directions.

More details can be found below, and on our website: 
http://www.statmt.org/wmt16/pronoun-task.html


Important Dates:

2nd February 2016, Release of training data
4th April 2016,         Release of test data
11th April 2016,        System submission
8th May 2016,   Paper submission deadline
5th June 2016,  Notification of acceptance
22nd June,              Camera-ready deadline


Mailing list: 
https://groups.google.com/forum/#!forum/wmt-2016-cross-lingual-pronoun-prediction-shared-task

-------------------------------------------------------------------------
Acknowledgements:
The organisation of this task has received support from the following project: 
Discourse-Oriented Statistical Machine Translation funded by the Swedish 
Research Council (2012-916)
-------------------------------------------------------------------------

=========================
Detailed Task Description
=========================

OVERVIEW

Pronoun translation poses a problem for current state-of-the-art SMT systems as 
pronoun systems do not map well across languages, e.g., due to differences in 
gender, number, case, formality, or humanness, and to differences in where 
pronouns may be used. Translation divergences typically lead to mistakes in 
SMT, as when translating the English "it" into French ("il", "elle", or 
"cela"?) or into German ("er", "sie", or "es"?). One way to model pronoun 
translation is to treat it as a cross-lingual pronoun prediction task.

We propose such a task, which asks participants to predict a target-language 
pronoun given a source-language pronoun in the context of a sentence. We 
further provide a lemmatised target-language human-authored translation of the 
source sentence, and automatic word alignments between the source sentence 
words and the target-language lemmata. In the translation, the words aligned to 
a subset of the source-language third-person pronouns are substituted by 
placeholders. The aim of the task is to predict, for each placeholder, the word 
that should replace it from a small, closed set of classes, using any type of 
information that can be extracted from the documents.

The cross-lingual pronoun prediction task will be similar to the task of the 
same name at DiscoMT 2015:

http://www.idiap.ch/workshop/DiscoMT/shared-task
Participants are invited to submit systems for the English-French and 
English-German language pairs, for both directions.


TASK DESCRIPTION

In the cross-lingual pronoun prediction task, you are given a source-language 
document with a lemmatised and POS-tagged human-authored translation and a set 
of word alignments between the two languages. In the translation, the 
lemmatised tokens aligned to the source-language third-person pronouns are 
substituted by placeholders. Your task is to predict, for each placeholder, the 
fully inflected word token that should replace the placeholder from a small, 
closed set of classes. I.e., to provide the fully inflected (German|French) 
translation of the English pronoun in the context sketched by the 
lemmatised/tagged target side (in the case of English-to-German|French 
translation). You may use any type of information that you can extract from the 
documents.

Lemmatised and POS-tagged target-language data is provided in place of fully 
inflected text. The provision of lemmatised data is intended both to provide a 
challenging task, and to simulate a scenario that is more closely aligned with 
working with machine translation system output. POS tags provide additional 
information which may be useful in the disambiguation of lemmas (e.g. noun vs. 
verb, etc.) and in the detection of patterns of pronoun use.

The pronoun prediction task will be run for the following sub-tasks:
English-to-German
German-to-English
English-to-French
French-to-English

Details of the source-language pronouns and the prediction classes that exist 
for each of the above sub-tasks are provided in the following section (below). 
The different combinations of source-language pronoun and target-language 
prediction classes represent some of the different problems that SMT systems 
face when translating pronouns for a given language pair and translation 
direction.

The task will be evaluated automatically by matching the predictions against 
the words found in the reference translation by computing the overall accuracy 
and precision, recall and F-score for each class. The primary score for the 
evaluation is the macro-averaged F-score over all classes. Compared to 
accuracy, the macro-averaged F-score favours systems that consistently perform 
well on all classes and penalises systems that maximise the performance on 
frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel 
source-target text with word alignments. In the target-language text, a subset 
of the words aligned to source-language occurrences of a specified set of 
pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is 
the index of the source-language word the placeholder is aligned to. Your task 
is to predict one of the classes listed in the relevant source-target section 
below, for each occurrence of a placeholder.

The training, development and test datasets have been filtered to remove 
non-subject position pronouns. Additional filtering has also been applied to 
the test set to remove erroneous pronoun examples and thereby ensure the fair 
and accurate evaluation of system performance. For more information on the 
format of the data files and their filtering, please see the website.

The complete test data for the classification task, including reference 
translations and word alignments, will be released on 4th April 2016. Your 
submission is due on 11th April 2016.


SOURCE-LANGUAGE PRONOUN SETS AND TARGET-LANGUAGE PREDICTION CLASS DETAILS

The following sections describe the set of source-language pronouns and 
target-language classes to be predicted, for each of the four sub-tasks. Please 
note that the sub-tasks are asymmetric in terms of the source-language pronouns 
and prediction classes. The selection of the source-language pronouns and their 
target-language prediction classes for each sub-task is based on the variation 
that is possible when translating a given source-language pronoun. For example, 
when translating the English pronoun "it" into French, a decision must be made 
as to the gender of the French pronoun, with "il" and "elle" both providing 
valid options. The translation of the English pronouns "he" and "she" into 
French, however, does not require such a decision. These may simply be mapped 
1-to-1, as "il" and "elle" respectively. The translation of "he" and "she" from 
English into French is therefore not considered an "interesting" problem and as 
such, these pronouns are excluded from the source-language set for the 
English->French sub-task. In the opposite translation, the French pronoun "il" 
may be translated as "it" or "he", and "elle" as "it" or "she". As a decision 
must be taken as to the appropriate target-language translation of "il" and 
"elle", these are included in the set of source-language pronouns for the 
French->English sub-task.

You should *always* predict either a word token or "OTHER". See prediction 
class lists below for a list of word tokens to predict for each sub-task.

English-to-French

This sub-task will concentrate on the translation of subject position "it" and 
"they" from English into French. The following prediction classes exist for 
this sub-task:

* ce: The French pronoun ce (sometimes with elided vowel as c') as in the 
expression c'est "it is"
* elle: Feminine singular subject pronoun
* elles: Feminine plural subject pronoun
* il: Masculine singular subject pronoun
* ils: Masculine plural subject pronoun
* cela: Demonstrative pronouns. Includes "cela", "ça", the misspelling "ca", 
and the rare elided form "ç' "
* on: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

French-to-English

This sub-task will concentrate on the translation of subject position "elle", 
"elles", "il", and "ils" from French into English. The following prediction 
classes exist for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted

English-to-German

This sub-task will concentrate on the translation of subject position "it" and 
"they" from English into German. The following prediction classes exist for 
this sub-task:

* er: Masculine singular subject pronoun
* sie: Feminine singular subject pronoun
* es: Neuter singular subject pronoun
* man: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

German-to-English

This sub-task will concentrate on the translation of subject position "er", 
"sie" and "es" from German into English. The following prediction classes exist 
for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* you: Second person pronoun (with both generic or deictic uses)
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted


_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to