Training Data Released - Second VarDial Evaluation Campaign

Within the scope of the VarDial workshop, co-located with COLING 2018, we are 
organizing an evaluation campaign on similar languages, varieties and dialects 
with multiple shared tasks.


We are organizing five shared tasks this year:

- (ADI) Arabic Dialect Identification: The third edition of the ADI task will 
address the multi-dialectal challenge in spoken Arabic in broadcast news 
domain. Previously, we have shared acoustic features and lexical word sequence 
extracted from large-vocabulary speech recognition (LVCSR). This year, we will 
add phonetic features, which will enable researchers to use both prosodic and 
phonetic features, which are helpful for distinguishing between different 

- (GDI) German Dialect Identification: After a successful first edition of the 
(Swiss) German Dialect Identification task in 2017, we organize a second 
iteration of this task. We provide updated data on the same Swiss German 
dialect areas as last year (Basel, Bern, Lucerne, Zurich), but add a fifth 
"surprise dialect" for which no training data is made available. 

- (MTT) Morphosyntactic Tagging of Tweets: This task focuses on morphosyntactic 
annotation (900+ labels) of non-canonical Twitter varieties of three 
South-Slavic languages -- Slovene, Croatian, and Serbian. Task participants 
will be provided with large manually annotated and raw canonical datasets, as 
well as small manually annotated Twitter datasets. 

- (DFS) Discriminating between Dutch and Flemish in Subtitles: The task focuses 
on determining whether a text is written in the Netherlandic or the Flemish 
variant of the Dutch language. For this task, participants are provided with a 
dataset consisting of almost 100,000 professionally produced subtitles for 
movies, documentaries and television shows. 

- (ILI) Indo-Aryan Language Identification: This task focuses on identifying 5 
closely-related languages of the Indo-Aryan language family – Hindi, Braj 
Bhasha, Awadhi, Bhojpuri, and Magahi. These languages form part of a continuum 
starting from Western Uttar Pradesh (Hindi and Braj Bhasha) to Eastern Uttar 
Pradesh (Awadhi and Bhojpuri) and the neighbouring Eastern state of Bihar 
(Bhojpuri and Magahi). For this task, participants will be provided with a 
dataset of approximately 15,000 sentences in each language, mainly from the 
domain of literature, published over the web as well as in print.

To participate and to receive the training data (released March 12) please fill 
up the registration form available at the workshop website. 

The VarDial workshop will take place in August 2018 in Santa Fe, USA.

Marcos on behalf of the VarDial organizers

Dr. Marcos Zampieri
Research Fellow
Research Group in Computational Linguistics
University of Wolverhampton, UK
UNSUBSCRIBE from this page:
Corpora mailing list

Reply via email to