In this newsletter:
LDC at LREC-COLING 2024

New publications:
Call My Net 1<https://catalog.ldc.upenn.edu/LDC2024S05>
Automatic Content Extraction for 
Portuguese<https://catalog.ldc.upenn.edu/LDC2024T05>
________________________________
LDC at LREC-COLING 2024
LDC will be exhibiting at LREC-COLING 2024<https://lrec-coling-2024.org/> 
hosted by the European Language Resources Association (ELRA) and the 
International Committee on Computational Linguistics (ICCL) May 20-25 in Turin, 
Italy. Stop by our table to learn more about recent developments at the 
Consortium and the latest publications.

LDC staff members will also be presenting current work on topics including 
Spanless Event Annotation for Corpus-Wide Complex Event Understanding; Schema 
Learning Corpus: Data and Annotation Focused on Complex Events; and KoFREN: 
Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech 
Corpora.

LDC will post conference updates via social media. We look forward to seeing 
you in Italy!
________________________________

New publications:
Call My Net 1<https://catalog.ldc.upenn.edu/LDC2024S05> was developed by LDC 
and contains 364 hours of conversational telephone speech in four languages 
(Tagalog, Cebuano, Cantonese, and Mandarin) collected in 2015 from 221 native 
speakers located in the Philippines and China along with metadata and speaker 
demographic information. Recordings and data from this collection were used to 
support the NIST 2016 Speaker Recognition 
Evaluation<https://www.nist.gov/itl/iad/mig/speaker-recognition-evaluation-2016>.

Speakers made 10 telephone calls each to people within their existing social 
networks, using different handsets and under a variety of noise conditions. 
Speakers were connected through a robot operator to carry on casual 
conversations on topics of their choice. All recordings were manually audited 
to confirm language and speaker requirements. The documentation for this 
release includes metadata about phone type, noise conditions, and call quality. 
Speaker demographic information on year of birth, sex, and native language is 
also included.

2024 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

Automatic Content Extraction for 
Portuguese<https://catalog.ldc.upenn.edu/LDC2024T05> was developed at INESC TEC 
- Instituto de Engenharia de Sistemas e Computadores, Tecnologia e 
Ciência<https://www.inesctec.pt/en> and consists of automatic Brazilian 
Portuguese and European Portuguese translations of the English text and 
annotations in ACE 2005 Multilingual Training Corpus 
(LDC2006T06)<https://catalog.ldc.upenn.edu/LDC2006T06>.

ACE 2005 Multilingual Training Corpus was developed by LDC to support the 
Automatic Contract Extraction 
(ACE)<https://www.ldc.upenn.edu/collaborations/past-projects/ace> program, 
specifically, by providing training data for the 2005 technology evaluation. It 
contains 1,800 files of mixed genre text in Arabic, English, and Chinese 
annotated for entities, relations, and events. The objective of the ACE program 
was to develop automatic content extraction technology to support automatic 
processing of human language in text form. Text genres included newswire, 
broadcast news, broadcast conversation, weblog, discussion forums, and 
conversational telephone speech.

For this translation, the English data was partitioned into training, 
development, and test sets. The documents were split into sentences and each 
event mention was assigned to its sentence. Source sentences and their 
annotations were translated into Brazilian Portuguese using Google 
Translate<https://translate.google.com/> and into European Portuguese using 
DeepL Translate<https://www.deepl.com/en/translator>. An alignment algorithm 
and a parallel corpus word aligner were used to handle mismatches between 
translated annotations and their translated sentences.

2024 members can access this corpus through their LDC account. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104



_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to