In this newsletter:
LDC membership discounts expire March 2
Spring 2026 data scholarship recipient

New publications:
2022 NIST Language Recognition Evaluation Test and Development 
Sets<https://catalog.ldc.upenn.edu/LDC2026S03>
KAIROS Schema Learning Background Source 
Data<https://catalog.ldc.upenn.edu/LDC2026T02>
LORELEI Russian Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2026T01>

________________________________
LDC membership discounts expire March 2
Time is running out to save on 2026 membership fees. Renew your LDC membership, 
rejoin the Consortium, or become a new member by March 2 to receive a 10% 
discount. For more information on membership benefits and options, visit Join 
LDC<https://www.ldc.upenn.edu/members/join-ldc>.

Spring 2026 data scholarship recipient
Congratulations to the recipient of LDC's Spring 2026 data scholarship:

Doma Akshitha Reddy: Chaitanya Bharathi Institute of Technology (India): 
Bachelor of Engineering, Information Technology. Doma is awarded copies of 
TIMIT Acoustic-Phonetic Continuous Speech Corpus and The CMU Kids Corpus for 
their work in child speech.

Since 2010, LDC has awarded scholarships to successful student applicants twice 
each year. To date more than 242 corpora have been distributed to 162 students 
across 38 countries. We proudly celebrate their achievements and the 
contributions their research has made to the broader community.

The next round of applications will be accepted in September 2026. For 
information about the program, visit the Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________

New publications:
2022 NIST Language Recognition Evaluation Test and Development 
Sets<https://catalog.ldc.upenn.edu/LDC2026S03> was developed by LDC and  
NIST<https://www.nist.gov/> and contains the test and development data, 
metadata, answer keys, and documentation for the 2022 NIST Language Recognition 
Evaluation (LRE22). The source data is comprised of 222 hours of conversational 
telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: 
Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African 
English, Indian-accented South African English, North African French, Ndebele, 
Oromo, Tigrinya, Tsonga, Venda, Xhosa, and Zulu.

For the CTS collections, a small number of native speakers made single calls to 
multiple individuals in their social network. Calls lasted 8-15 minutes; 
speakers were free to discuss any topic. The BNBS data was collected from 
streaming radio programming, focused on broadcasts that included narrowband 
speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides 
and portions of each broadcast recording were manually audited by native 
speakers to verify language and quality.

LRE22 
<https://www.nist.gov/publications/2022-nist-language-recognition-evaluation> 
emphasized language recognition for African languages, including low resource 
languages, and expanded the range of test segment durations. Further 
information about the 2022 evaluation can be found in the 2022 NIST Language 
Recognition Evaluation Plan. <https://lre.nist.gov/uassets/3>

2026 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

KAIROS Schema Learning Background Source 
Data<https://catalog.ldc.upenn.edu/LDC2026T02> was developed by LDC and 
includes 14,000 English and Spanish documents representing text, audio, video, 
image, and multimedia resources collected during the DARPA KAIROS program as 
supplemental background source data for the KAIROS Schema Learning Corpus 
(SLC). The purpose of the supplemental collection was to increase the amount of 
English and Spanish data with multimedia components for schema learning and to 
add domains not well represented in existing Spanish data. The supplemental 
data in this release includes material from the business and logistics domains, 
instructional documents and multimedia news.

The complete set of SLC background source data (including the data in this 
publication) totaled 16.2 million English, Russian, and Spanish documents and 
more than 125,000 audio, video, image, or multimedia resources. A large portion 
of that data was drawn from pre-existing LDC datasets.

The SLC and KAIROS Schema Learning Complex Event Annotation 
(LDC2025T07)<https://catalog.ldc.upenn.edu/LDC2025T07>, containing English and 
Spanish text, audio, video, and image material labeled for 93 real-world 
complex events, constitute the data used by KAIROS system developers for schema 
learning.

KAIROS systems utilized formal event representations in the form of schema 
libraries that specified the steps, preconditions, and constraints for an open 
set of complex events; schemas were then used in combination with event 
extraction to characterize and make predictions about real-world events in a 
large multilingual, multimedia corpus.

2026 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

LORELEI Russian Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2026T01> contains over 1.26 billion words 
of Russian monolingual text, 360,00 words of which were translated into 
English, 3 million words of found Russian-English parallel text, and 87,000 
Russian words translated from English data. Approximately 83,000 words were 
annotated for simple named entities, around 26,000 words were annotated for 
full entity (including nominals and pronouns), entity linking and situation 
frames (identifying entities, needs, and issues) and nearly 9,000 words were 
covered by noun phrase chunking annotation. Data was collected from discussion 
forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2026 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to