In this newsletter:
Renew your LDC membership today
30th Anniversary Highlight: CSR

New publications:
AIDA Ukrainian Broadcast and Telephone Speech Audio and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2023S01>
LORELEI Swahili Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T01>
________________________________
Renew your LDC membership today
The importance of curated resources for language-related education, research, 
and technology development drives LDC's mission to create them, to accept data 
contributions from researchers across the globe, and to broadly share such 
resources through the LDC Catalog. LDC members enjoy no-cost access to new 
corpora released annually, as well as the ability to license legacy data sets 
from among our 925+ holdings at reduced fees. Ensure that your data needs 
continue to be met by renewing your LDC membership or by joining the Consortium 
today.

Now through March 1, 2023, 2022 members receive a 10% discount on 2023 
membership, and new or returning organizations receive a 5% discount. 
Membership remains the most economical way to access current and past LDC 
releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for more 
details on membership options and benefits.

30th Anniversary Highlight: CSR
The CSR (continuous speech recognition) corpus series was developed in the 
early 1990s under DARPA's Spoken Language Program to support research on 
large-vocabulary CSR systems.

CSR-I (WSJ0) Complete (LDC93S6A)<https://catalog.ldc.upenn.edu/LDC93S6A> and 
CSR-II (WSJ1) Complete (LDC94S13A)<https://catalog.ldc.upenn.edu/LDC94S13A> 
contain speech from a machine-readable corpus of Wall Street Journal news text. 
They also include spontaneous dictation by journalists of hypothetical news 
articles as well as transcripts.

The text in CSR-I (WSJ0) was selected to fall within either a 5,000-word subset 
or a 20,000-word subset. Audio includes speaker-dependent and 
speaker-independent sections as well as sentences with verbalized and 
nonverbalized punctuation. (Doddington, 
1992<https://aclanthology.org/H92-1074.pdf>). CSR-II features "Hub and Spoke" 
test sets that include a 5,000-word subset and a 64,000-word subset. Both data 
sets were collected using two microphones: a close-talking Sennheiser HMD414 
and a second microphone of varying type.

WSJ0 Cambridge Read News (LDC95S24)<https://catalog.ldc.upenn.edu/LDC95S24> was 
developed by Cambridge University and consists of native British English 
speakers reading CSR WSJ news text, specifically, sentences from the 5,000-word 
and 64,000-word subsets. All speakers also recorded a common set of 18 
adaptation sentences.

The CSR corpora continue to have value for the research community. CSR-I (WSJ0) 
target utterances were used in the CHiME2 and CHiME3 challenges which focused 
on distant-microphone automatic speech recognition in real-world environments. 
CHiME2 WSJ0 (LDC2017S10)<https://catalog.ldc.upenn.edu/LDC2017S10> and CHiME2 
Grid (LDC2017S07)<https://catalog.ldc.upenn.edu/LDC2017S07> each contain over 
120 hours of English speech from a noisy living room environment. CHiME3 
(LDC2017S24)<https://catalog.ldc.upenn.edu/LDC2017S24> consists of 342 hours of 
English speech and transcripts from noisy environments and 50 hours of noisy 
environment audio.

CSR-I target utterances were also used in the Distant-Speech Interaction for 
Robust Home Applications (DIRHA) Project which addressed natural spontaneous 
speech interaction with distant microphones in a domestic environment. DIRHA 
English WSJ Audio (LDC2018S01)<https://catalog.ldc.upenn.edu/LDC2018S01> is 
comprised of approximately 85 hours of real and simulated read speech from 
native American English speakers in an apartment setting with typical domestic 
background noises and inter/intra-room reverberation effects.

Multi-Channel WSJ Audio (LDC2014S03)<https://catalog.ldc.upenn.edu/LDC2014S03>, 
designed to address the challenges of speech recognition in meetings, contains 
100 hours of audio from British English speakers reading sentences from WSJ0 
Cambridge Read News. There were three recording scenarios: a single stationary 
speaker, two stationary overlapping speakers, and one single moving speaker.

All CSR corpora and their related data sets are available for licensing by 
Consortium members and non-members. Visit Obtaining 
Data<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more 
information.
________________________________
New publications:

AIDA Ukrainian Broadcast and Telephone Speech Audio and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2023S01> and is comprised of 
approximately 156 hours of Ukrainian conversational telephone speech and 
broadcast news audio with 1.2 million words of corresponding orthographic 
transcripts.

The news audio data was taken from 87 recordings broadcast by various Ukrainian 
sources. The telephone speech was generated from telephone calls by native 
Ukrainian speakers to acquaintances in their social network. Native Ukrainian 
speakers manually segmented the data into sentence-level units as part of the 
transcription process.

The broadcast recordings and transcripts were produced by LDC to support the 
DARPA AIDA (Active Interpretation of Disparate Alternatives) program which 
aimed to develop a multi-hypothesis semantic engine to generate explicit 
alternative interpretations of events, situations, and trends from a variety of 
unstructured sources. The telephone speech audio recordings were collected by 
LDC to support the NIST 2011 Language Recognition Evaluation 
<https://www.nist.gov/itl/iad/mig/2011-language-recognition-evaluation>  and 
are also contained in Multi-Language Conversational Telephone Speech 2011 - 
Slavic Group LDC2016S11<https://catalog.ldc.upenn.edu/LDC2016S11>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
LORELEI Swahili Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T01> was developed by LDC and is 
comprised of approximately 4.3 million words of Swahili monolingual text, 
90,000 Swahili words translated from English data, and 545,000 words of found 
Swahili-English parallel text. Approximately 100,000 words were annotated for 
named entities and up to 26,000 words were annotated for entity discovery and 
linking and situation frames (identifying entities, needs and issues). Data was 
collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104


_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to