Opening of the Faetar Low-Resource ASR Challenge 2025
We are pleased to officially announce the opening of the Faetar Low-Resource
ASR Challenge 2025. While we were not able to secure a special session
dedicated to the challenge at the conference, we strongly encourage submission
of papers describing your systems to Interspeech 2025. As such, we plan to
adhere to a timeline that will allow us to return test results and announce
winners in time for participants to prepare Interspeech papers (see below).
Challenge website: https://perceptimatic.github.io/faetarspeech/
The Faetar Low-Resource ASR Challenge aims to focus researchers’ attention on
several issues which are common to many archival collections of speech data:
- noisy field recordings
- lack of standard orthography, leading to noise in the transcriptions in the
form of transcriber inconsistencies
- only a few hours of transcribed data
- a larger collection of untranscribed data
- no additional data in the language (textual or speech) that is easily
available
- “dirty” transcriptions in documents, which contain matter that needs to be
filtered out
By focusing multiple research groups on a single corpus of this kind, we aim to
gain deeper insights into these problems than can be achieved otherwise.
The challenge uses the Faetar ASR Benchmark Corpus. Faetar (pronounced
[fajdar]) is a variety of the Franco-Provençal language which developed in
isolation in Italy, far from other speakers of Franco-Provençal, and in close
contact with Italian. Faetar has less than 1000 speakers around the world, in
Italy and in the diaspora. It is endangered, and preservation, learning, and
documentation are a priority for many community members. The benchmark data
represents the majority of all archived speech recordings of Faetar in
existence, and it is not available from any other source.
We propose four tracks:
- Constrained ASR. Participants should focus on the challenge of improving ASR
architectures to work with small, poor-quality sets. Participants may not use
any resources to train / fine-tune their models beyond the files contained in
the provided train set. No external pre-trained acoustic models or language
models are allowed, and the use of the unlabelled portion of the Faetar
challenge data set is not allowed either.
Three other “thematic tracks” can be explored, and should not be considered
mutually exclusive:
- Using pre-trained acoustic models or language models. Participants focus on
the most effective way to make use of models pre-trained on other languages.
- Using unlabelled data. The challenge data also includes ~20 hrs of unlabelled
data. Participants focus on finding the most effective way to make use of it.
- Dirty data. The training data was extracted and automatically aligned from
long-form audio and partial transcriptions in “cluttered” word processor files,
relying on (error-prone) VAD, scraping, and alignment. Participants focus on
improving the pipeline for extracting useful training data, with the ultimate
goal of improving performance.
Submissions will be evaluated on phone error rate (PER) on the test set.
Participants are provided with a dev kit allowing them to calculate the PER on
dev and train, as well as reproduce the baselines.
For more information, and to register and obtain the data and the dev kit,
please visit the challenge website:
https://perceptimatic.github.io/faetarspeech/
For more information, or for questions, please contact us by writing to
[email protected].
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]