Call for Participation - Third VarDial Evaluation Campaign

Within the scope of the VarDial workshop, co-located with NAACL 2019, we are 
organizing an evaluation campaign on similar languages, varieties and dialects 
with multiple shared tasks.

URL: https://sites.google.com/view/vardial2019/campaign

We are organizing five shared tasks this year (check the website for more 
information):

- German Dialect Identification (GDI): After two successful editions of the 
(Swiss) German Dialect Identification task, we propose to organize a third 
iteration of this task. We will again focus on four Swiss German dialect areas 
(Basel, Bern, Lucerne, Zurich). We provide updated speech transcripts for all 
dialect areas and also release corresponding acoustic data in the form of 
iVectors as well as (predicted) word-level normalisation. In particular, the 
acoustic data may help to overcome transcriber bias; the recent iterations of 
the ADI task have already shown that acoustic features substantially improve 
dialect identification.

- Cross-lingual Morphological Analysis (CMA): We introduce the task of 
cross-lingual morphological analysis. Given a word in an unknown related 
language, for example "navifraghju" ("shipwreck" in Corsican), a human speaker 
of several related languages is able to deduce that it is a noun in the 
singular by making deductions from similar words, for example: "naufragi" 
(Catalan), "naufragio" (Spanish, Italian), "naufrágio" (Portuguese) and 
"naufrage" (French). In this task we invite participants to create 
computational models which will be able to do the same. There will be two 
language families represented, Romance (fusional morphology) and Turkic  
(agglutinative morphology). In the "Closed" track, participants will be given a 
set of word forms with all valid morphological analyses in six languages and 
asked to predict the valid morphological analyses for a seventh, unseen 
language. In the "Semi-Closed" track, the process will be the same, only 
participants will be provided with additional raw data by the organisers. This 
will take the form of raw text Wikipedia dumps, bilingual dictionaries from the 
Apertium project and any treebanks available in the known languages from the 
Universal Dependencies project.

- Discriminating between Mainland and Taiwan variation of Mandarin Chinese 
(DMT): Like English, Mandarin has several varieties among the speaking 
communities. This task aims at discriminating between two major varieties of 
Mandarin Chinese: Putonghua (Mainland China) and Guoyu (Taiwan). We provide a 
corpus of approximately 10,000 sentences belonging to the domain of news for 
each of the Mandarin variation. The main task will be to determine if the 
sentence belongs to news articles from Mainland China or from Taiwan. The 
sentences are tokenized and punctuation is removed from the texts. Both the 
traditional and the simplified versions of the same corpus are available.

- Moldavian vs. Romanian Cross-topic Identification (MRC): In the Moldavian vs. 
Romanian Cross-topic Identification shared task we provide participants with 
the MOROCO data set which contains Moldavian and Romanian samples of text 
collected from the news domain. The samples belong to one of the following six 
topics: culture, finance, politics, science, sports, tech. The samples are 
preprocessed in order to eliminate named entities. For each sample, the data 
set provides corresponding dialectal and category labels. To this end, we 
propose three sub-tasks for the 2019 VarDial Evaluation Campaign. The first 
sub-task is a binary classification by dialect task, in which a classification 
model is required to discriminate between the Moldavian and the Romanian 
dialects. The second sub-task is a Moldavian to Romanian cross-dialect 
multi-class classification by topic task, in which a model is required to 
classify the samples written in the Romanian dialect into six topics, using 
samples written in the Moldavian dialect for training. Finally, the third 
sub-task is a Romanian to Moldavian cross-dialect multi-class classification by 
topic task, in which a model is required to classify the samples written in the 
Moldavian dialect into six topics, using samples written in the Romanian 
dialect for training.

- Cuneiform Language Identification (CLI): This task focuses on discriminating 
between languages and dialects originally written using the cuneiform signs. 
The task includes 2 different languages: Sumerian and Akkadian. Furthermore, 
the Akkadian language is divided into six dialects: Old Babylonian, Middle 
Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, 
and Neo Assyrian. These languages and dialects were used in ancient Mesopotamia 
and span a time period of 3,000 years. For training and development, we provide 
the participants with varying amounts of text encoded in Unicode cuneiform 
signs for each language or dialect. We are interested in seeing whether the 
task of language identification between dialects using the same logosyllabic 
writing system is different from language identification between languages 
using segmental scripts.

To participate and to receive the training data please fill the registration 
form available on the workshop website. The test sets will be released on 
February 5, 2019.

Best,
Marcos

-----
Dr. Marcos Zampieri
Research Group in Computational Linguistics
University of Wolverhampton, UK
http://pers-www.wlv.ac.uk/~u22984/
_______________________________________________
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to