In this newsletter:
Join LDC for membership year 2025
Spring 2025 data scholarship application deadline

New publications:
LORELEI Yoruba Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T10>
Samrómur Synthetic<https://catalog.ldc.upenn.edu/LDC2024S12>

________________________________
Join LDC for membership year 2025
It's time to renew your LDC membership for 2025. Current (2024) members who 
renew their membership before March 3, 2025, will receive a 10% discount. New 
or returning organizations will receive a 5% discount if they join the 
Consortium by March 3.

In addition to receiving new publications, current LDC members enjoy the 
benefit of licensing older data from our Catalog of 950+ holdings at reduced 
fees. Current-year for-profit members may use most data for commercial 
applications.

Plans for next year's publications are in progress. Among the expected releases 
are:

  *   Iraqi Arabic - English Lexical Database: a set of six interrelated tables 
(roots, lemmas, wordforms, multi-word expressions, English definitions, example 
phrases) presenting each Iraqi Arabic word in Arabic script and IPA format, a 
result of LDC's collaboration with Georgetown University Press to enhance and 
update three dialectal Arabic dictionaries
  *   AIDA topic source data and annotations: multimodal source data and 
annotations in multiple languages (Russian, English, Spanish) for information 
and entity extraction
  *   2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of 
conversational telephone speech and broadcast narrow band speech in six 
linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) 
representing 20 languages, used in NIST's 2015 language recognition evaluation
  *   BOLT CALLFRIEND CALLHOME CTS audio, transcripts and translations: 
previously unpublished Chinese and Egyptian Arabic telephone conversations from 
the CALLFRIEND and CALLHOME collections, with transcripts and translations 
developed by LDC for the DARPA BOLT program

  *   Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from 
ancient and modern Chinese texts with syntactic annotation based on sentence 
constituent analysis, developed by Beijing Normal University and Peking 
University

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations, and queries in multiple 
languages (e.g., Georgian, Kazakh, Lithuanian)
  *   LORELEI: representative and incident language packs containing 
monolingual text, bi-text, translations, annotations, supplemental resources, 
and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali)

For full descriptions of all LDC data sets, browse our 
Catalog<https://catalog.ldc.upenn.edu/>. Visit Join 
LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership, user 
accounts and payment.

Spring 2025 data scholarship application deadline
Applications are now being accepted through January 15, 2025, for the Spring 
2025 LDC data scholarship program which provides university students with 
no-cost access to LDC data. Consult the LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>
 page for more information about program rules and submission requirements.
________________________________

New publications:

LORELEI Yoruba Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2024T10> was developed by LDC and is 
comprised of approximately 7.2 million words of Yoruba monolingual text, 
127,000 Yoruba words translated from English data, and 810,000 words of 
Yoruba-English parallel text. Approximately 77,000 words were annotated for 
named entities, over 25,000 words were annotated for full entity (including 
nominals and pronouns) and simple semantic annotation, and around 10,000 words 
were annotated for noun phrase chunking. Data was collected from discussion 
forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

Samrómur Synthetic<https://catalog.ldc.upenn.edu/LDC2024S12> was developed by 
the Language and Voice Lab, Reykjavik University<https://lvl.ru.is/> and 
contains 72 hours of Icelandic synthetic speech, transcripts and metadata. 
Source sentences were extracted from the Samrómur 
platform<https://samromur.is>, comprised of texts and transcripts covering 
various genres. Text was processed through a text-to-speech system developed by 
Reykjavik University's Language and Voice Lab to generate speech files. 
Synthesized speech was created with 44 voices (22 male, 22 female) at four 
different speed rates for a total of 220 speakers and 62,700 utterances (with 
285 sentences/speaker).

2024 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to