[Corpora-List] August 2025 Newsletter - LDC

Penn LDC via Corpora Fri, 15 Aug 2025 08:20:42 -0700

In this newsletter:
LDC at Interspeech 2025
Fall 2025 LDC data scholarship program


New publications:
Mixer 6 - ChiME 8 Transcribed Calls and 
Interviews<https://catalog.ldc.upenn.edu/LDC2025S07>
Abstract Meaning Representation 2.0 - Machine 
Translations<https://catalog.ldc.upenn.edu/LDC2025T10>
KAIROS Phase 1 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T11>

________________________________
LDC at Interspeech 2025
LDC will be exhibiting at Interspeech 2025<https://www.interspeech2025.org/>, 
held this year August 17-21 in Rotterdam, the Netherlands. Stop by our booth to 
say hello and learn about the latest developments at the Consortium. Also be on 
the lookout for the following presentations, posters, and special sessions 
featuring LDC work:

Comparative Evaluation of Acoustic Feature Extraction Tools for Clinical Speech 
Analysis
Monday, August 18, 11:00-13:00 - Area5-Oral1 - Speech Analysis, Detection and 
Classification 1

Reasoning-Based Approach with Chain-of-Thought for Alzheimer's Detection Using 
Speech and Large Language Models
Tuesday, August 19, 13:30-15:30 - Area1-Poster2B - Databases and Progress in 
Methodology

Special Session: Challenges in Speech Collection, Curation and 
Annotation<https://sites.google.com/view/speech-data-cca-is25/>
Wednesday, August 20, 13:30-15:30 - Area14-SS7 - Part 1
Wednesday, August 20, 16:00-18:00 - Area14-SS8 - Part 2

TELVID: A Multilingual Multi-modal Corpus for Speaker Recognition
Thursday, August 21, 13:30-15:30 - AREA4-Oral8 - Speaker Recognition

LDC also supported the Interspeech 2025 URGENT 
Challenge<https://urgent-challenge.github.io/urgent2025/> which aims to bring 
more attention to constructing Universal, Robust, and Generalizable speech 
EnhancemeNT models.

LDC will post conference updates via our social media platforms. We look 
forward to seeing you in Rotterdam!

Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being 
accepted now through September 15, 2025. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________

New publications:
Mixer 6 - CHiME 8 Transcribed Calls and 
Interviews<https://catalog.ldc.upenn.edu/LDC2025S07> was developed for the 7th 
and 8th CHiME (Computational Hearing in Multisource 
Environments)<https://www.chimechallenge.org/> challenges. It contains 80 hours 
of English interviews and telephone speech from Mixer 6 Speech 
(LDC2013S03)<https://catalog.ldc.upenn.edu/LDC2013S03> with transcripts 
developed for the CHiME challenges divided into training, development, and test 
sets. This data was used in CHiME 7 Task 
1<https://www.chimechallenge.org/challenges/chime7/task1/index> and CHiME 8 
Task 1<https://www.chimechallenge.org/challenges/chime8/task1/>, both of which 
focused on transcription and segmentation across varied recording conditions 
such as interviews, meetings, and dinner parties, with an emphasis on 
generalization across recording device types and array topologies.

The data includes audio from Mixer 6 Speech recorded on 13 microphones for a 
total of 1063 hours (corresponding to 80 hours of speech). The development and 
test sets are speaker-disjoint from the training data and consist of fully 
transcribed, multi-microphone interviews. Each transcript segment was labeled 
with the speaker, the uttered text, and the start and end times in seconds for 
that segment.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

Abstract Meaning Representation 2.0 - Machine 
Translations<https://catalog.ldc.upenn.edu/LDC2025T10> was developed at the 
University of Edinburgh, School of Informatics 
<https://www.ed.ac.uk/informatics> and the University of 
Zurich,<https://www.uzh.ch/en.html> Department of Computational 
Linguistics<https://www.cl.uzh.ch/en.html>. It consists of Spanish, German, 
Italian, and Mandarin Chinese automatic translations of the source English and 
professionally-translated Spanish, German, Italian, and Mandarin Chinese 
sentences in Abstract Meaning Representation 2.0 - Four Translations 
(LDC2020T07)<https://catalog.ldc.upenn.edu/LDC2020T07>. The translations were 
collected through Google Translate between May 2018 and March 2024.

The source English sentences are a subset (1,371 sentences) of the sentences 
contained in Abstract Meaning Representation (AMR) Annotation Release 2.0 
(LDC2017T10)<https://catalog.ldc.upenn.edu/LDC2017T10>, a semantic treebank of 
over 39,000 English natural language sentences from broadcast conversations, 
newswire, and web text.

Translations were from each of the five languages (English, Spanish, German, 
Italian, and Mandarin Chinese) to the other four languages (Spanish, German, 
Italian, and Mandarin Chinese) covering 20 language pairs. The dataset contains 
1371 source sentences in each language, each with a professionally translated 
source sentence and multiple dated translations by Google Translate.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

KARIOS Phase 1 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T11> was developed 
by LDC and contains English and Spanish text, video, and image data and 
annotations used for pre-evaluation research and system development during 
Phase 1 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly 
defined tasks designed to explore specific evaluation objectives enabling 
KAIROS system developers to exercise individual system components on a small 
data set prior to the full program evaluation. This corpus contains the 
complete set of Quizlet data used in Phase 1 which focused on two real-world 
complex events (CEs) within the Improvised Explosive Device bombing scenario: 
CE1001 (2018 Caracas drone attack) and CE1002 (Utah High School backpack 
bombing).

Source data was collected from the web; 30 root web pages were collected and 
processed, yielding 29 text data files, 216 image files and 5 video files. 
Annotation steps included labeling scenario-relevant events and relations for 
each document to develop a structured representation of temporally ordered 
events, relations, and arguments and generating a reference knowledge graph.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over 
Schemas) program aimed to build technology capable of understanding and 
reasoning about complex real-world events in order to provide actionable 
insights to end users. KAIROS systems utilized formal event representations in 
the form of schema libraries that specified the steps, preconditions, and 
constraints for an open set of complex events; schemas were then used in 
combination with event extraction to characterize and make predictions about 
real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: l...@ldc.upenn.edu<mailto:l...@ldc.upenn.edu>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] August 2025 Newsletter - LDC

Reply via email to