In this newsletter:
LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07>
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13>
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14>

LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount for 
renewing their membership, and new or returning organizations receive a 5% 
discount. Membership remains the most economical way to access current and past 
LDC releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for 
details on membership options and benefits.

Approaching deadline for Spring 2023 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to 
LDC data for your research. Applications for Spring 2023 data scholarships are 
due January 15, 2023. For more information on requirements and program rules, 
see LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

30th Anniversary Highlight: AMR
Abstract Meaning Representation (AMR) annotation was developed by LDC, 
SDL/Language Weaver, Inc., the University of Colorado's Computational Language 
and Educational Research group, and the Information Sciences Institute at the 
University of Southern California. It is a semantic representation language 
that captures "who is doing what to whom" in a sentence. Each sentence is 
paired with a graph that represents its whole-sentence meaning in a 
tree-structure. AMR utilizes PropBank frames, non-core semantic roles, 
within-sentence coreference, named entity annotation, modality, negation, 
questions, quantities, and so on to represent the semantic structure of a 
sentence largely independent of its syntax.

LDC's Catalog contains three cumulative English AMR publications: Release 1.0 
(LDC2014T12<https://catalog.ldc.upenn.edu/LDC2014T12>), Release 2.0 
(LDC2017T10<https://catalog.ldc.upenn.edu/LDC2017T10>), and Release 3.0  
(LDC2020T02<https://catalog.ldc.upenn.edu/LDC2020T02>). The combined result in 
AMR 3.0 is a semantic treebank of roughly 59,255 English natural language 
sentences from broadcast conversations, newswire, weblogs, web discussion 
forums, fiction, and web text, and includes multi-sentence annotations.

LDC has also published Chinese Abstract Meaning Representation 1.0 
(LDC2019T07<https://catalog.ldc.upenn.edu/LDC2019T07>) and 2.0 
(LDC2021T13<https://catalog.ldc.upenn.edu/LDC2021T13>), developed by Brandeis 
University and Nanjing Normal University. These corpora contain AMR annotations 
for approximately 20,000 sentences from Chinese Treebank 8.0 
(LDC2013T21<https://catalog.ldc.upenn.edu/LDC2013T21>). Chinese AMR follows the 
basic principles developed for English, making adaptations where necessary to 
accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations 
(LDC2020T07<https://catalog.ldc.upenn.edu/LDC2020T07>), developed by the 
University of Edinburgh, School of Informatics, consists of Spanish, German, 
Italian, and Chinese Mandarin translations of a subset of sentences from AMR 
2.0.

Visit LDC's Catalog <https://catalog.ldc.upenn.edu/> for more details about 
these publications.
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07> was 
developed by LDC and contains nearly 70,000 images of machine printed text with 
corresponding annotations and transcripts in 13 languages: Arabic, Chinese, 
English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, 
and Vietnamese. This corpus is a subset of data created for a broader effort to 
support the development and evaluation of optical character recognition and 
related technologies for 35 languages across 24 unique script types.

Most images were annotated for text localization, resulting in over 2.3M 
line-level bounding boxes; 1250 images per language were also annotated with 
orthographic transcriptions of each line plus specification of reading order, 
yielding over 2.4M tokens of transcribed text. The resulting annotations are 
represented in an XML output format defined for this corpus. Data for each 
language is partitioned into test, train, or validation sets.

2022 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13> consists of 12 
hours of read speech and time-aligned transcripts in Standard Thai from 50 
speakers (33 female, 17 male) reading 120 sentences selected from the Thai 
National Corpus<https://www.arts.chula.ac.th/ling/tnc/>, the Thai Junior 
Encyclopedia<https://www.au.edu/royal-activities/the-thai-encyclopedia-for-youth-project.html>,
 and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 
2016. Speakers were recruited in the Bangkok metropolitan area; they were 
native Thais, fluent in Standard Thai, and literate.

This data set was developed as part of LDC's Global TIMIT project which aims to 
create a series of corpora in a variety of languages with a similar set of key 
features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus 
(LDC93S1)<https://catalog.ldc.upenn.edu/LDC93S1> which was designed for 
acoustic-phonetic studies and for the development and evaluation of automatic 
speech recognition systems.

2022 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14> was 
developed by LDC and contains 33 hours of English and Chinese speech data along 
with corresponding annotations used in support of the Third DIHARD 
Challenge<https://dihardchallenge.github.io/dihard3>.

The DIHARD third development and evaluation sets were drawn from diverse 
sources including monologues, map task dialogues, broadcast interviews, 
sociolinguistic interviews, meeting speech, speech in restaurants, clinical 
recordings, and amateur web videos. Annotations include diarization and 
segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options; or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104


_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to