Call for Participation - VarDial Evaluation Campaign 2020

Within the scope of the seventh VarDial workshop, co-located with COLING 2020, 
we are organizing an evaluation campaign on similar languages, varieties and 
dialects with three shared tasks.

URL: https://sites.google.com/view/vardial2020/evaluation-campaign

To participate and to receive the training data please fill the registration 
form available on the workshop website. 

The tasks we are organizing this year are the following (please check the 
website for more information):

- RDI - Romanian Dialect Identification: In the Romanian Dialect Identification 
(RDI) shared task we provide participants with the MOROCO data set for 
training, which contains Moldavian (MD) and Romanian (RO) samples of text 
collected from the news domain. The task is a binary classification by dialect, 
in which a classification model is required to discriminate between the 
Moldavian (MD) and the Romanian (RO) dialects. The task is closed, therefore, 
participants are not allowed to use external data to train their models. The 
test set will contain newly collected text samples, not previously included in 
MOROCO. The test samples will come from a different domain, hence the methods 
have to take the cross-domain nature of the task into account.

- SMG - Social Media Variety Geolocation: Most existing VarDial tasks are 
language identification tasks: they are framed as classification tasks in which 
each instance is associated with a language variety label. For many language 
areas, defining a set of discrete labels is not trivial, as there is a 
continuum between varieties rather than clear-cut borders. Therefore, we 
introduce a geolocation task this year: given a text, the participants have to 
predict its geographic location in terms of latitude/longitude coordinates. 
Geolocation can be framed as a double regression task, but more sophisticated 
model architectures have been proposed (e.g., Rahimi et al. 2017a, 2017b). 
Using data from the social media platforms Twitter and Jodel, we provide three 
subtasks for three language areas: 1. Standard German Jodels; 2. Swiss German 
Jodels; 3. BCMS Tweets. All three subtasks will use the same data format and 
evaluation methodology, and participants are encouraged to submit their systems 
for all subtasks.

- ULI - Uralic Language Identification: This task focuses on discriminating 
between the languages in the Uralic group as defined by the ISO 639-3 standard. 
The task includes 29 individual relevant languages, some of which are extremely 
closely related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish 
(fit). These languages are used from Scandinavia, Estonia, and Finland all the 
way to the Russian Siberia. Many of the languages used within Russia are 
written using modified Cyrillic alphabets. Most of the included languages can 
be defined as under-resourced, for example, Karelian (krl) and Livvi-Karelian 
(olo), which have less than 40,000 native speakers combined. Even more 
challenging examples are Nganasan, with estimated 125 speakers and very limited 
online presence, and Kemi Sami, which is extinct and even scarcely documented. 
We acknowledge that the ISO 639-3 classification which we have used may not be 
without problems, but especially within the purposes of this shared task it 
identifies these 29 language varieties adequately. Three tracks are available 
in this shared task. More information and data available on the website.

Best,
Marcos
_______________________________________________
Mt-list site list
[email protected]
http://lists.eamt.org/mailman/listinfo/mt-list

Reply via email to