Opening of the Faetar Low-Resource ASR Challenge 2025

We are pleased to officially announce the opening of the Faetar Low-Resource 
ASR Challenge 2025. While we were not able to secure a special session 
dedicated to the challenge at the conference, we strongly encourage submission 
of papers describing your systems to Interspeech 2025. As such, we plan to 
adhere to a timeline that will allow us to return test results and announce 
winners in time for participants to prepare Interspeech papers (see below).

       Challenge website: https://perceptimatic.github.io/faetarspeech/

The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on 
several issues which are common to many archival collections of speech data:

- noisy field recordings
- lack of standard orthography, leading to noise in the transcriptions in the 
form of transcriber inconsistencies
- only a few hours of transcribed data
- a larger collection of untranscribed data
- no additional data in the language (textual or speech) that is easily 
available
- “dirty” transcriptions in documents, which contain matter that needs to be 
filtered out

By focusing multiple research groups on a single corpus of this kind, we aim to 
gain deeper insights into these problems than can be achieved otherwise.

The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced 
[fajdar]) is a variety of the Franco-Provençal language which developed in 
isolation in Italy, far from other speakers of Franco-Provençal, and in close 
contact with Italian. Faetar has less than 1000 speakers around the world, in 
Italy and in the diaspora. It is endangered, and preservation, learning, and 
documentation are a priority for many community members. The benchmark data 
represents the majority of all archived speech recordings of Faetar in 
existence, and it is not available from any other source.

We propose four tracks:

- Constrained ASR. Participants should focus on the challenge of improving ASR 
architectures to work with small, poor-quality sets. Participants may not use 
any resources to train / fine-tune their models beyond the files contained in 
the provided train set. No external pre-trained acoustic models or language 
models are allowed, and the use of the unlabelled portion of the Faetar 
challenge data set is not allowed either.

Three other “thematic tracks” can be explored, and should not be considered 
mutually exclusive:

- Using pre-trained acoustic models or language models. Participants focus on 
the most effective way to make use of models pre-trained on other languages.
- Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled 
data. Participants focus on finding the most effective way to make use of it.
- Dirty data. The training data was extracted and automatically aligned from 
long-form audio and partial transcriptions in “cluttered” word processor files, 
relying on (error-prone) VAD, scraping, and alignment. Participants focus on 
improving the pipeline for extracting useful training data, with the ultimate 
goal of improving performance.

Submissions will be evaluated on phone error rate (PER) on the test set. 
Participants are provided with a dev kit allowing them to calculate the PER on 
dev and train, as well as reproduce the baselines.

For more information, and to register and obtain the data and the dev kit, 
please visit the challenge website:

    https://perceptimatic.github.io/faetarspeech/

For more information, or for questions, please contact us by writing to 
[email protected].







_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to