In this newsletter:
LDC/Penn receives US Dept of Education research grant
Membership year 2025 publication preview
Fall 2024 data scholarship recipients

New publications:
RST Continuity Corpus<https://catalog.ldc.upenn.edu/LDC2024T08>
MultiTACRED<https://catalog.ldc.upenn.edu/LDC2024T09>

________________________________
LDC/Penn receives US Dept of Education research grant
LDC and Penn's Graduate School of Education and Department of Computer and 
Information Science are part of a team that was recently awarded a $10 million 
grant from the US Department of 
Education<https://ies.ed.gov/funding/grantsearch/details.asp?ID=6066> to 
develop the Using Generative Artificial Intelligence for Reading R&D Center 
(U-GAIN Reading) which will explore using generative AI to improve elementary 
school reading instruction for English learners. Led by the education nonprofit 
Digital Promise, U-GAIN Reading will build on an existing research-based 
tutoring platform, Amira Learning, that is used by more than 1 million students 
each year. The LDC/Penn team will contribute expertise in computational 
linguistics, computer science, and learning analytics. An evaluation team at 
MDRC will measure learner outcomes both to improve the R&D and to benchmark its 
eventual impacts. Additional experts in the science of reading, ethics, and 
strategies for national impact will support the project's work. Data developed 
in the project will be shared with the community through the LDC Catalog.

Membership year 2025 publication preview
The 2025 membership year is approaching and plans for next year's publications 
are in progress. Among the expected releases are:

  *   Iraqi Arabic - English Lexical Database:  a set of six interrelated 
tables (roots, lemmas, wordforms, multi-word expressions, English definitions, 
example phrases) presenting each Iraqi Arabic word in Arabic script and IPA 
format, a result of LDC's collaboration with Georgetown University Press to 
enhance and update three dialectal Arabic dictionaries
  *   AIDA topic source data and annotations: multimodal source data and 
annotations in multiple languages (Russian, English, Spanish) for information 
and entity extraction
  *   2015 NIST Language Recognition Evaluation Test Set: 164,000+ segments of 
conversational telephone speech and broadcast narrow band speech in six 
linguistic varieties (Arabic, Spanish, English, Chinese, Slavic, French) 
representing 20 languages, used in NIST's 2015 language recognition evaluation
  *   BOLT CALLFRIEND CALLHOME CTS Audio, Transcripts and Translations: 
previously unpublished Chinese and Egyptian Arabic telephone conversations from 
the CALLFRIEND and CALLHOME collections, with transcripts and translations 
developed by LDC for the DARPA BOLT program

  *   Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from 
ancient and modern Chinese texts with syntactic annotation based on sentence 
constituent analysis, developed by Beijing Normal University and Peking 
University

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations, and queries in multiple 
languages (e.g., Georgian, Kazakh, Lithuanian)
  *   LORELEI: representative and incident language packs containing 
monolingual text, bi-text, translations, annotations, supplemental resources, 
and related tools in various languages (e.g., Hungarian, Hindi, Amharic, Somali)

Check your inbox for more information about membership renewal.

Fall 2024 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2024 data scholarships:

Yomma Gamaleldin: Alexandria University (Egypt): Master's student, Computer and 
Systems Engineering Department. Yomma is awarded a copy of Qatari Corpus of 
Argumentative Writing (LDC2022T04) for her work in Arabic automated essay 
scoring.

Arhane Mahaganapathy: Jaffna University (Sri Lanka): Master's student, 
Department of Computer Science. Ahrane is awarded copies of IARPA Babel Tamil 
Language Pack (LDC2017S13) and Multi-Language Telephone Speech 2011 - South 
Asian (LDC2017S14) for her work in Tamil speech-to-text systems.

Sivashanth Suthakar:  Jaffna University (Sri Lanka): Master's student, 
Department of Computer Science. Sivashanth is awarded copies of CAMIO 
Transcription Languages (LDC2022T07) and LORELEI Tamil Representative Language 
Pack (LDC2023T03) for his work in Tamil OCR systems.

Oshan Yalegama: University of Moratuwa (Sri Lanka): BSc, Electronic and 
Telecommunication Engineering. Oshan is awarded copies of CSR-I (WSJ0) Complete 
(LDC93S6A) and TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) for 
his work in audio signal processing.

Samer Mohammed Yaseen: Sana'a University (Yemen): PhD candidate, Faculty of 
Computer and Information Technology. Samer is awarded a copy of Arabic Newswire 
Part 1 (LDC2001T55) for his work in Arabic information retrieval.
________________________________

New publications:
RST Continuity Corpus<https://catalog.ldc.upenn.edu/LDC2024T08> was developed 
at Åbo Akademi University and Humboldt-Universität zu Berlin and contains 
annotations for continuity dimensions added to RST Discourse Treebank 
(LDC2002T07)<https://catalog.ldc.upenn.edu/LDC2002T07>. RST Discourse Treebank 
is a collection of English news texts from the Penn 
Treebank<https://catalog.ldc.upenn.edu/LDC99T42> annotated for rhetorical 
relations under the RST (Rhetorical Structure Theory) framework. In RST 
Continuity Corpus, the relations are annotated for the seven continuity 
dimensions: time, space, reference, action, perspective, modality, and speech 
act. The relations are also annotated for polarity, order of segments, 
nuclearity, and context.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

MultiTACRED<https://catalog.ldc.upenn.edu/LDC2024T09> was developed by the 
German Research Center for Artificial Intelligence (DFKI) Speech and Language 
Technology 
Lab<https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology>
 and is a machine translation of TAC Relation Extraction Dataset 
(LDC2018T24)<https://catalog.ldc.upenn.edu/LDC2018T24> (TACRED) into twelve 
languages with projected entity annotations. TACRED is a large-scale relation 
extraction dataset containing 106,264 examples built over English newswire and 
web text used in the NIST TAC KBP English slot filling evaluations during the 
period 2009-2014. The training and evaluation data for the TAC KBP slot filling 
tasks was developed by the Linguistic Data Consortium.

TACRED training, development, and test splits were translated into Arabic, 
Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, 
Spanish, and Turkish using DeepL<https://www.deepl.com/> or Google 
Translate<https://translate.google.com>. The test split was back-translated 
into English to generate machine-translated English test data. TACRED 
annotations are specified by token offsets. For translation, tokens were 
concatenated with white space, and the entity offsets were converted into 
XML-style markers to denote argument.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104






_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to