In this newsletter:
Renew your LDC membership today

New publications:
KASET - Kurmanji and Sorani Kurdish Speech and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2024S01>
LORELEI Farsi Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T01>
________________________________
Renew your LDC membership today
The importance of curated resources for language-related education, research, 
and technology development drives LDC's mission to create them, to accept data 
contributions from researchers across the globe, and to broadly share such 
resources through the LDC Catalog. LDC members enjoy no-cost access to new 
corpora released annually, as well as the ability to license legacy data sets 
from among our 950+ holdings at reduced fees. Ensure that your data needs 
continue to be met by renewing your LDC membership or by joining the Consortium 
today.

Now through March 1, 2024, 2023 members receive a 10% discount on 2024 
membership, and new or returning organizations receive a 5% discount. 
Membership remains the most economical way to access current and past LDC 
releases. Consult Join 
LDC<https://www.ldc.upenn.edu/communications/newsletter/january-2022-newsletter>
 for more details on membership options and benefits.
________________________________
New publications:
KASET - Kurmanji and Sorani Kurdish Speech and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2024S01> consists of 147 hours of 
telephone conversations (289 recordings) and broadcast news (410 recordings) in 
two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish along with 
transcripts covering 60 hours of those recordings. Kurdish is spoken primarily 
in Turkey, Iran, Iraq, and Syria. Sorani and Kurmanji are the two widely spoken 
dialects of the Kurdish language.

The telephone speech was generated from calls by native Kurdish speakers in the 
United States to North American acquaintances in their social network. The 
broadcast news audio was collected from multiple streaming radio and television 
broadcast programs (narrowband and wideband audio), many of which contained a 
mix of Kurmanji and Sorani Kurdish. Native speaker auditors identified a 5-10 
minute span from each broadcast recording for transcription. Full telephone 
recordings that passed the native speaker audit were transcribed. This release 
includes speaker information, such as gender, year of birth, and language.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

LORELEI Farsi Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T01> was developed by LDC and is 
comprised of approximately 250 million words of Farsi monolingual text, 120,000 
Farsi words translated from English data, and 751,000 words of found 
Farsi-English parallel text. Approximately 75,000 words were annotated for 
named entities and up to 22,000 words were annotated for entity discovery and 
linking and situation frames (identifying entities, needs, and issues). Data 
was collected from discussion forum, news, reference, social network, and 
weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to