Call for Participation - VarDial Evaluation Campaign 2023

Within the scope of the tenth VarDial workshop, co-located with EACL 2023, we 
are organizing an evaluation campaign on similar languages, varieties and 
dialects with three shared tasks. To participate and to receive the training 
data please fill the registration form on the workshop website:
https://sites.google.com/view/vardial-2023/shared-tasks

We are organizing the following tasks this year (please check the website for 
more information):

1. SID for low-resource language varieties (SID4LR)

This task is Slot and Intent Detection (SID) for low-resource language 
varieties. Slot detection is a span labeling task, intent detection a 
classification task. The test set will contain Swiss German (GSW), South 
Tyrolean (DE-ST), and Neapolitan (NAP). This shared task seeks to answer the 
following question: How can we best do zero-shot transfer to low-resource 
language varieties without standard orthography?

The training data consists of the xSID-0.4 corpus, containing data from Snips 
and Facebook. The original training data is in English, but we also provide 
automatic translations of the training data into German, Italian and other 
languages (the projected nmt-transfer data from van der Goot et al., 2021). 
Participants are allowed to use other data to train on, as long as it is not 
annotated for SID in the target languages.

Participants are not required to submit systems for both tasks, it is also 
possible to only participate in one of the two tasks, intent detection 
(classification) or slot detection (span labeling). The systems will be 
evaluated with the span F1 score for slots and accuracy for intents as the main 
evaluation metric as is standard for these tasks. Participants may also submit 
systems for a subset of the three target languages.

2. Discriminating Between Similar Languages - True Labels (DSL-TL)

Discriminating between similar languages (e.g., Croatian and Serbian) and 
language varieties (e.g., Brazilian and European Portuguese) has been a popular 
topic at VarDial since its first edition. The DSL shared tasks organized in 
2014, 2015, 2016, and 2017 have addressed this issue by providing participants 
with the DSL Corpus Collection (DSLCC), a collection of journalistic texts 
containing texts written in multiple similar languages and language varieties. 
The DSLCC was compiled under the assumption that each instance's gold label is 
determined by where the text is retrieved from. While this is a straightforward 
(and mostly accurate) practical assumption, previous research has shown the 
limitations of this problem formulation as some texts may present no linguistic 
marker that allows systems or native speakers to discriminate between two very 
similar languages or language varieties.

We tackle this important limitation by introducing the DSL True Labels (DSL-TL) 
task. DSL-TL will provide participants with a human-annotated DSL dataset. A 
sub-set of nearly 13,000 sentences were retrieved from the DSLCC and annotated 
by multiple native speakers of the included language and varieties included, 
namely English (American and British), Portuguese (Brazilian and European), 
Spanish (Argentinian and Peninsular). To the best of our knowledge this is the 
first dataset of its kind opening exciting new avenues for language 
identification research.

3. Discriminating Between Similar Languages - Speech (DSL-S)

In the DSL-S 2023 shared task, participants use training and development sets 
from the Mozilla Common Voice (CV) to develop a language identifier for speech. 
The nine languages selected for the task come from four different subgroups of 
Indo-European or Uralic language families. The test data used in this task is 
the Common Voice test data for the nine languages. The participants are asked 
not to evaluate their systems themselves nor in any other way investigate the 
test data before the shared task results have been published. The total amount 
of unpacked speech data is around 15 gigabytes. Only the .mp3 files from the 
test set must be used when generating the results. The metadata concerning the 
test audio files, including their transcriptions, must not be used. This task 
is audio only.

The 9-way classification task is divided into two separate tracks. Only the 
training and development data in the Common Voice dataset are allowed in the 
closed track, and no other data must be used. This prohibition includes systems 
and models trained (unsupervised or supervised) on any other data. On the open 
track, the use of any openly available (available to any possible shared task 
participant) datasets and models not including or trained on the Mozilla Common 
Voice test set is allowed.

Dates

Training set release: January 23, 2023
Test set release: February 6, 2023
Submissions due: February 17, 2023
Paper submission deadline: February 27, 2023
Notification of acceptance: March 13, 2023
Camera-ready papers due: March 27, 2023

Of course, VarDial also accepts research papers focusing on computational 
methods and language resources for closely related languages, language 
varieties, and dialects. The full call for papers can be found here:
https://sites.google.com/view/vardial-2023/call-for-papers

Contact: [email protected]<mailto:[email protected]> or 
[email protected]<mailto:[email protected]>
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to