[Corpora-List]June 2022 Newsletter - LDC

Penn LDC via Corpora Wed, 15 Jun 2022 08:03:50 -0700

In this newsletter:
LDC at LREC 2022
LDC data and commercial technology development
30th Anniversary Highlight: TIMIT


New publication:
Second DIHARD Challenge Evaluation - Eleven 
Sources<https://catalog.ldc.upenn.edu/LDC2022S06>
________________________________
LDC at LREC 2022
LDC will attend the 13th Language Resource Evaluation Conference 
(LREC2022<https://lrec2022.lrec-conf.org/en/>), hosted by ELRA, the European 
Language Resource Association, in Marseille, France June 20-25, 2022. Several 
LDC staff members will be presenting current work on topics including 
WeCanTalk: A New Multi-language, Multi-modal Resource for Speaker Recognition; 
Reflections on 30 Years of Language Resource Development and Sharing; A Study 
in Contradiction: Data and Annotation for AIDA Focusing on Informational 
Conflict in Russia-Ukraine Relations; Data Protection, Privacy and US 
Regulation; BeSt: The Belief and Sentiment Corpus; and more.

Stay tuned for specific announcements on LDC's social media pages regarding 
presentation times and locations. Following the conference, LDC's presented 
papers and posters will be available on the Papers 
Page.<https://www.ldc.upenn.edu/language-resources/papers/ldc-papers>

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.

30th Anniversary Highlight: TIMIT
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is another of the classic 
releases in LDC's Catalog. Designed for the acquisition of acoustic-phonetic 
knowledge and for the development and evaluation of automatic speech 
recognition systems, it contains recordings of 630 American English speakers 
each reading 10 phonetically rich sentences, for a total of 6300 utterances 
comprising 2342 distinct sentences. Data collection and annotation were a joint 
effort by Texas Instruments, the Massachusetts Institute of Technology, and SRI 
International, and the data release was prepared by NIST (National Institute of 
Standards and Technology).

TIMIT was among the first publications that appeared with the launch of LDC's 
catalog in 1993. It remains one of the Consortium's top ten distributed corpora 
and may be the single most widely-used speech database. Despite its age and 
small size relative to modern data sets, TIMIT's wide range of 
phonetically-representative inputs, its time-aligned lexical and phonemic 
transcripts, and its easy availability through the LDC Catalog have contributed 
to its widespread use and continued popularity. Thousands of researchers 
remember its famous first sentence: "she had your dark suit in greasy wash 
water all year".

LDC continues the TIMIT series with its Global TIMIT project which aims to 
create a series of corpora in a variety of languages with TIMIT-like features. 
(Chanchaochai et al., 2018). Data sets published from that project include: 
Global TIMIT Learner Treebank English, Global TIMIT Learner Simple English, 
Global TIMIT Mandarin Chinese - Guanzhong Dialect, and Global TIMIT Mandarin 
Chinese.

The LDC Catalog features over 900 holdings in more than 90 languages and more 
data is added each year. All TIMIT corpora are available for licensing by 
Consortium members and non-members. Visit Obtaining 
Data<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more 
information.
________________________________
New publication:
Second DIHARD Challenge Evaluation - Eleven 
Sources<https://catalog.ldc.upenn.edu/LDC2022S06> was developed by LDC and 
contains approximately 20 hours of English and Chinese speech data along with 
corresponding annotations used in support of the Second DIHARD 
Challenge<https://dihardchallenge.github.io/dihard2>.

The DIHARD second development and evaluation sets were drawn from diverse 
sources including monologues, map task dialogues, broadcast interviews, 
sociolinguistic interviews, meeting speech, speech in restaurants, clinical 
recordings, extended child language acquisition recordings, and web videos. 
Annotations include diarization and segmentation.

Second DIHARD Challenge Evaluation - Eleven Sources is distributed via web 
download.

2022 Subscription Members will automatically receive copies of this corpus. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options; or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Corpora-List]June 2022 Newsletter - LDC

Reply via email to