[Corpora-List] April 2024 Newsletter - LDC

Penn LDC via Corpora Mon, 15 Apr 2024 09:24:11 -0700

In this newsletter:
New publications:
LoReHLT Hausa Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T03>
AIDA Scenario 2 Practice Topic Source 
Data<https://catalog.ldc.upenn.edu/LDC2024T04>


________________________________
New publications:
LoReHLT Hausa Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T03> was developed by LDC and is 
comprised of approximately 4.4 million words of Hausa monolingual text, 86,000 
Hausa words translated from English data, and 30 minutes of Hausa audio 
recordings. Approximately 96,000 words were annotated for named entities and 
over 13,000 words were annotated for full entity including nominals and 
pronouns. Noun-phrase chunking was applied to more than 7,400 words. Over 9,600 
words were labeled with simple semantic annotation. Topic annotation was 
applied to the audio recordings. Data was collected from discussion forum, 
news, reference, social network, amateur web audio recordings, and weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low 
Resource Languages for Emergent Incidents) program was concerned with building 
human language technology for low resource languages in the context of emergent 
situations. Representative languages were selected to provide broad typological 
coverage.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

AIDA Scenario 2 Practice Topic Source 
Data<https://catalog.ldc.upenn.edu/LDC2024T04> was developed by LDC and is 
comprised of 1500 root documents (text, image, and video) from English, 
Russian, and Spanish web sources. Each phase of the AIDA program centered on a 
specific scenario, or broad topic area, with related subtopics designated as 
either practice subtopics or evaluation subtopics. The Phase 2 scenario focused 
on the socioeconomic and political crisis in Venezuela since 2010. This corpus 
constitutes the full set of topic-focused documents for Phase 2 practice 
subtopics.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed 
to develop a multi-hypothesis semantic engine to generate explicit alternative 
interpretations of events, situations and trends from a variety of unstructured 
sources. LDC supported AIDA by collecting, creating and annotating multimodal 
linguistic resources in multiple languages.

The knowledge base for entity detection and linking annotation for all AIDA 
Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 
Reference Knowledge Base (LDC2023T10)<https://catalog.ldc.upenn.edu/LDC2023T10>.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] April 2024 Newsletter - LDC

Reply via email to