In this newsletter:
Fall 2024 LDC Data Scholarship program

New publications:
LORELEI Uyghur Incident Language Pack<https://catalog.ldc.upenn.edu/LDC2024T07>
Ravnursson Faroese Speech and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2024S09>

________________________________
Fall 2024 LDC Data Scholarship program
Student applications for the Fall 2024 LDC Data Scholarship program are being 
accepted now through September 15, 2024. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________

New publications:
LORELEI Uyghur Incident Language Pack<https://catalog.ldc.upenn.edu/LDC2024T07> 
was developed by LDC and is comprised of 28 million words of Uyghur monolingual 
text, 500,000 words of English monolingual text, 3.3 million words of parallel 
and comparable Uyghur-English text, and 200,000 words annotated for simple 
named entities and situation frames. It constitutes all of the text data, 
annotations, supplemental resources, and related software tools for the Uyghur 
language that were used in the DARPA LORELEI / LoReHLT 2016 
Evaluation<https://www.nist.gov/itl/iad/mig/lorehlt-evaluations>.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. In the evaluation scenario, an unforeseen 
event triggered a need for humanitarian and logistical support in a region 
where the incident language had received little or no attention in NLP 
research. Evaluation participants provided NLP solutions, including information 
extraction and machine translation, with limited resources and limited 
development time.

Data was collected from news, social network, weblog, newsgroup, discussion 
forum, and reference material. Named entity annotation identified entities to 
be detected by systems for scoring purposes. Situation frame analysis was 
designed to extract basic information about needs and relevant issues for 
planning a disaster response effort.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

Ravnursson Faroese Speech and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2024S09> contains 109 hours of 
Faroese prompted speech from 433 speakers (249 female, 184 male), corresponding 
transcripts and speaker metadata. It is an extract from the Basic Language 
Resource Kit 1.0 (BLARK 
1.0)<https://mtd.setur.fo/en/resource/ravnur-blark-1-0/> developed by the Faroe 
Islands' Ravnur Project<https://mtd.setur.fo/en/>.

Speech data was collected in 2022. Speakers from all major dialect areas in the 
Faroe Islands in three age groups -- 15-35, 36-60, and 61+ years -- read texts 
that included a word list, a phrase list, closed vocabulary readings, and short 
texts. Recordings also contain spontaneous speech. Orthographic transcripts are 
included.

2024 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data at no cost.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to