In this newsletter:
LDC data and commercial technology development

New publications:
L2-KSU Native and Non-Native Arabic 
Speech<https://catalog.ldc.upenn.edu/LDC2024S11>
MATERIAL Somali-English Language Pack<https://catalog.ldc.upenn.edu/LDC2024S10>

________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product, or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.
________________________________

New publications:
L2-KSU Native and Non-Native Arabic 
Speech<https://catalog.ldc.upenn.edu/LDC2024S11> was developed by King Saud 
University<http://ksu.edu.sa/en/> (KSU) and contains approximately six hours of 
Modern Standard Arabic read speech from 80 subjects, along with transcripts and 
speaker metadata.

The speech data was collected in 2022 from 40 native and 40 non-native 
speakers. Native speakers were from Saudi Arabia, Egypt, and Palestine, and 
provided audio recordings through the crowd sourcing platform 
Khamsat<https://khamsat.com/>. Non-native speakers were Central and West 
African students enrolled in KSU's Arabic Linguistics Institute; they provided 
speech recordings on site. All subjects read a series of ten sentences, 
repeating each sentence multiple times.

2024 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

*

MATERIAL Somali-English Language Pack<https://catalog.ldc.upenn.edu/LDC2024S10> 
was developed by Appen<http://www.appen.com/> for the IARPA (Intelligence 
Advanced Research Projects Activity) 
MATERIAL<https://www.iarpa.gov/index.php/research-programs/material> (Machine 
Translation for English Retrieval of Information in Any Language) program. It 
contains 80 hours of Somali conversational telephone speech, transcripts, 
English translations, annotations, and queries.

Calls were made using different telephones (e.g., mobile, landline) from a 
variety of environments. Transcripts cover approximately 10% of the speech 
files, and approximately 4% of the speech files were translated into English. 
This release also includes domain annotations, English queries, and their 
relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to 
build cross language information retrieval systems to find speech and text 
content using English search queries.

2024 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to