[Corpora-List] September 2025 Newsletter - LDC

Penn LDC via Corpora Mon, 15 Sep 2025 09:23:20 -0700

In this newsletter:
LDC data and commercial technology development

New publications:
Mixer 7 English Speech<https://catalog.ldc.upenn.edu/LDC2025S08>
AIDA Scenario 1 Evaluation Topic Source Data, Annotation and 
Assessment<https://catalog.ldc.upenn.edu/LDC2025T13>
LORELEI Hindi Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T12>


________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.
________________________________

New publications:
Mixer 7 English Speech<https://catalog.ldc.upenn.edu/LDC2025S08> was developed 
by LDC and contains 12,321 hours of audio recordings of interviews, transcript 
readings, and conversational telephone speech involving 222 distinct English 
speakers. This material was collected by LDC in 2010-2011 as part of the Mixer 
project, and the recordings were used in the 2012 NIST SRE test set.

Recruited speakers were connected through a robot operator to carry on casual 
conversations on a pre-set topic lasting up to 10 minutes. Participants also 
visited LDC's Human Subjects Collection Lab equipped with a 14-microphone array 
where they participated in interviews and transcript readings, and conducted 
telephone calls under varying conditions. Selected speaker metadata was also 
collected.

2025 members can access this corpus through their LDC accounts. This corpus is 
a Members-Only release and is not available for non-member licensing. Contact 
l...@ldc.upenn.edu<mailto:l...@ldc.upenn.edu> for information about membership.

*

AIDA Scenario 1 Evaluation Topic Source Data, Annotation and 
Assessment<https://catalog.ldc.upenn.edu/LDC2025T13> was developed by LDC and 
is comprised of English, Russian, and Ukrainian web documents (text, video, 
image), annotations, and assessments used in the AIDA Phase 1 pilot and final 
evaluations. The Phase 1 scenario focused on political relations between Russia 
and Ukraine in the 2010s. The material in this corpus covers the following 
events: Suspicious Deaths and Murders in Ukraine (January-April 2015); Odessa 
Tragedy (May 2, 2014); and Siege of Sloviansk and Battle of Kramatorsk 
(April-July 2014).

The corpus contains 10,522 documents, annotations for 386 of those documents, 
and assessment results covering 77,965 responses in 1,525 of those documents. 
Annotations were performed in three steps: (1) within-document labels for 
scenario-related entities, relations, and events; (2) coreference annotation 
across documents by linking information elements to a knowledge base; and (3) 
indications of any relationship between labeled events/relations and hypotheses 
about the scenario. In the assessment phase, LDC annotators reviewed and judged 
system response files to provide evaluation organizers with a means for scoring 
submissions. Assessment tasks included zero-hop assessment, class-based 
assessment, graph assessment, and hypothesis assessment.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed 
to develop a multi-hypothesis semantic engine to generate explicit alternative 
interpretations of events, situations, and trends from a variety of 
unstructured sources. LDC supported AIDA by collecting, creating, and 
annotating multimodal linguistic resources in multiple languages.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

LORELEI Hindi Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T12> contains over 26 million words 
of Hindi monolingual text, 363,00 words of which were translated into English, 
1.07 million words of found Hindi-English parallel text, and 118,000 Hindi 
words translated from English data. Approximately 103,000 words were annotated 
for simple named entities and over 25,000 words were annotated for full entity 
(including nominals and pronouns), entity linking, and situation frames 
(identifying entities, needs and issues). Data was collected from discussion 
forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: l...@ldc.upenn.edu<mailto:l...@ldc.upenn.edu>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] September 2025 Newsletter - LDC

Reply via email to