[Corpora-List] December 2024 Newsletter - LDC

Penn LDC via Corpora Mon, 16 Dec 2024 09:28:50 -0800

In this newsletter:
LDC 2025 membership discounts now available
Approaching deadline for Spring 2025 data scholarship applications
LDC closed for Winter Break December 25-January 1


New publications:
MATERIAL Farsi-English Language Pack<https://catalog.ldc.upenn.edu/LDC2024S13>
Abstract Meaning Representation 3.0 - Machine 
Translations<https://catalog.ldc.upenn.edu/LDC2024T11>

________________________________
LDC 2025 membership discounts now available
Now through March 3, 2025, current 2024 members receive a 10% discount for 
renewing their membership, and new or returning organizations receive a 5% 
discount. Membership remains the most economical way to access current and past 
LDC releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for 
details on membership options and benefits.

Approaching deadline for Spring 2025 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to 
LDC data for your research. Applications for Spring 2025 data scholarships are 
due January 15, 2025. For more information on requirements and program rules, 
see LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

LDC closed for Winter Break December 25-January 1
LDC will be closed from Wednesday December 25, 2024, through Wednesday, January 
1, 2025, in accordance with the University of Pennsylvania Winter Break Policy. 
Our offices will reopen on Thursday, January 2, 2025. Requests received by the 
Membership Office during Winter Break will be processed when the office reopens.
________________________________

New publications:
MATERIAL Farsi-English Language Pack<https://catalog.ldc.upenn.edu/LDC2024S13> 
was developed by Appen<http://www.appen.com/> for the IARPA 
MATERIAL<https://www.iarpa.gov/index.php/research-programs/material> program 
and contains 61 hours of Farsi conversational telephone speech, transcripts, 
English translations, annotations, and queries. Calls were made using different 
telephones (e.g., mobile, landline) from a variety of environments. Transcripts 
cover approximately 30% of the speech files, and approximately 3% of the speech 
files were translated into English. This release also includes English queries 
and their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to 
build cross language information retrieval systems to find speech and text 
content using English search queries.

2024 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

*

Abstract Meaning Representation  3.0 - Machine 
Translations<https://catalog.ldc.upenn.edu/LDC2024T11> was developed by the 
Center for Computational Linguistics at KU 
Leuven<https://www.arts.kuleuven.be/ling/ccl> in the 
HORIZON2020<https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en>
 project SignON<https://cordis.europa.eu/project/id/101017255>. It is an 
automatic translation of a subset of sentences from Abstract Meaning 
Representation (AMR) Annotation Release 3.0 
(LDC2020T02)<https://catalog.ldc.upenn.edu/LDC2020T02> into Spanish, Irish 
Gaelic, and Dutch.

AMR 3.0 training, development, and test splits were translated using Google 
Translate<https://translate.google.com>. "Unsplit" directories were not 
translated and are not included in this release. Translations were not manually 
verified, but formal issues (such as unexpected new lines) were corrected, and 
special tokens and encoding issues were fixed with the Python tool 
ftfy.fix_text<https://ftfy.readthedocs.io/en/latest/>.

AMR 3.0 is a semantic treebank of over 59,000 English natural language 
sentences drawn from material collected by LDC, specifically, discussion forum 
text from the DARPA BOLT and DARPA DEFT programs, transcripts and English 
translations of Mandarin Chinese broadcast news programming, Wall Street 
Journal text, translated Xinhua news texts, various newswire texts from NIST 
OpenMT evaluations, and weblog data from the DARPA GALE program.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] December 2024 Newsletter - LDC

Reply via email to