[Corpora-List] December 2022 Newsletter - LDC

Penn LDC via Corpora Thu, 15 Dec 2022 10:33:28 -0800

In this newsletter:
LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07>
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13>
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14>

LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount for
renewing their membership, and new or returning organizations receive a 5%
discount. Membership remains the most economical way to access current and past
LDC releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for
details on membership options and benefits.

Approaching deadline for Spring 2023 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to
LDC data for your research. Applications for Spring 2023 data scholarships are
due January 15, 2023. For more information on requirements and program rules,
see LDC Data
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

30th Anniversary Highlight: AMR
Abstract Meaning Representation (AMR) annotation was developed by LDC,
SDL/Language Weaver, Inc., the University of Colorado's Computational Language
and Educational Research group, and the Information Sciences Institute at the
University of Southern California. It is a semantic representation language
that captures "who is doing what to whom" in a sentence. Each sentence is
paired with a graph that represents its whole-sentence meaning in a
tree-structure. AMR utilizes PropBank frames, non-core semantic roles,
within-sentence coreference, named entity annotation, modality, negation,
questions, quantities, and so on to represent the semantic structure of a
sentence largely independent of its syntax.

LDC's Catalog contains three cumulative English AMR publications: Release 1.0
(LDC2014T12<https://catalog.ldc.upenn.edu/LDC2014T12>), Release 2.0
(LDC2017T10<https://catalog.ldc.upenn.edu/LDC2017T10>), and Release 3.0
(LDC2020T02<https://catalog.ldc.upenn.edu/LDC2020T02>). The combined result in
AMR 3.0 is a semantic treebank of roughly 59,255 English natural language
sentences from broadcast conversations, newswire, weblogs, web discussion
forums, fiction, and web text, and includes multi-sentence annotations.

LDC has also published Chinese Abstract Meaning Representation 1.0
(LDC2019T07<https://catalog.ldc.upenn.edu/LDC2019T07>) and 2.0
(LDC2021T13<https://catalog.ldc.upenn.edu/LDC2021T13>), developed by Brandeis
University and Nanjing Normal University. These corpora contain AMR annotations
for approximately 20,000 sentences from Chinese Treebank 8.0
(LDC2013T21<https://catalog.ldc.upenn.edu/LDC2013T21>). Chinese AMR follows the
basic principles developed for English, making adaptations where necessary to
accommodate Chinese phenomena.

Abstract Meaning Representation 2.0 - Four Translations
(LDC2020T07<https://catalog.ldc.upenn.edu/LDC2020T07>), developed by the
University of Edinburgh, School of Informatics, consists of Spanish, German,
Italian, and Chinese Mandarin translations of a subset of sentences from AMR
2.0.

Visit LDC's Catalog <https://catalog.ldc.upenn.edu/> for more details about
these publications.
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07> was
developed by LDC and contains nearly 70,000 images of machine printed text with
corresponding annotations and transcripts in 13 languages: Arabic, Chinese,
English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu,
and Vietnamese. This corpus is a subset of data created for a broader effort to
support the development and evaluation of optical character recognition and
related technologies for 35 languages across 24 unique script types.

Most images were annotated for text localization, resulting in over 2.3M
line-level bounding boxes; 1250 images per language were also annotated with
orthographic transcriptions of each line plus specification of reading order,
yielding over 2.4M tokens of transcribed text. The resulting annotations are
represented in an XML output format defined for this corpus. Data for each
language is partitioned into test, train, or validation sets.

2022 members can access this corpus through their LDC accounts. Non-members may
license this data for a fee.
*
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13> consists of 12
hours of read speech and time-aligned transcripts in Standard Thai from 50
speakers (33 female, 17 male) reading 120 sentences selected from the Thai
National Corpus<https://www.arts.chula.ac.th/ling/tnc/>, the Thai Junior
Encyclopedia<https://www.au.edu/royal-activities/the-thai-encyclopedia-for-youth-project.html>,
and Thai Wikipedia, for a total of 6000 utterances. Data was collected in
2016. Speakers were recruited in the Bangkok metropolitan area; they were
native Thais, fluent in Standard Thai, and literate.

This data set was developed as part of LDC's Global TIMIT project which aims to
create a series of corpora in a variety of languages with a similar set of key
features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus
(LDC93S1)<https://catalog.ldc.upenn.edu/LDC93S1> which was designed for
acoustic-phonetic studies and for the development and evaluation of automatic
speech recognition systems.

2022 members can access this corpus through their LDC accounts. Non-members may
license this data for a fee.
*
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14> was
developed by LDC and contains 33 hours of English and Chinese speech data along
with corresponding annotations used in support of the Third DIHARD
Challenge<https://dihardchallenge.github.io/dihard3>.

The DIHARD third development and evaluation sets were drawn from diverse
sources including monologues, map task dialogues, broadcast interviews,
sociolinguistic interviews, meeting speech, speech in restaurants, clinical
recordings, and amateur web videos. Annotations include diarization and
segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to
"Receive Newsletter" under Account Options; or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] December 2022 Newsletter - LDC

Reply via email to