In this newsletter:
LanguageArc featured in Babel magazine
Fall 2023 LDC Data Scholarship Program

New publications:
Mixer 7 Spanish Speech<https://catalog.ldc.upenn.edu/LDC2023S04>
LORELEI Thai Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T08>
________________________________
LanguageArc featured in Babel magazine
The May 2023 edition of 
Babel<https://cloud.3dissue.com/18743/41457/106040/BabelNo43/index.html> (The 
Language Magazine) features an article about LDC's citizen science portal 
LanguageArc<https://languagearc.com/> (Language Analysis Research Community) 
and the diverse projects available there that utilize a variety of novel 
incentives to supplement traditional methods of creating data resources. 
Consider LanguageArc for your next collection project. Note: a subscription is 
necessary to view the article.

Fall 2023 LDC Data Scholarship Program
Student applications for the Fall 2023 LDC Data Scholarship program are being 
accepted now through September 15, 2023. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________
New publications:
Mixer 7 Spanish Speech<https://catalog.ldc.upenn.edu/LDC2023S04> was developed 
by LDC and contains 9,600 hours of audio recordings of interviews, transcript 
readings, and conversational telephone speech involving 191 distinct native 
Spanish speakers. This material was collected by LDC in 2011-2012 as part of 
the Mixer project, and the recordings were used in the 2012 NIST SRE test set.

Recruited speakers were connected through a robot operator to carry on casual 
conversations on a pre-set topic lasting up to 10 minutes. Participants also 
visited LDC's human subjects collection lab equipped with a 14-microphone array 
where they participated in interviews and transcript readings and conducted up 
to 3 telephone calls under varying conditions. Selected speaker metadata was 
also collected.

2023 members can access this corpus through their LDC accounts. This corpus is 
a members-only release and is not available for non-member licensing. Contact 
[email protected]<mailto:[email protected]> for information about membership.
*
LORELEI Thai Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T08> is comprised of over 39 million 
words of Thai monolingual text, 2.85 million words of found Thai-English 
parallel text, and 141,000 Thai words translated from English data. Over 
186,000 words were annotated for named entities and more than 25,000 words were 
annotated for entity discovery and linking and situation frames (identifying 
entities, needs, and issues). Data was collected from discussion forum, news, 
reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.



_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to