[Corpora-List] November 2022 Newsletter - LDC

Penn LDC via Corpora Tue, 15 Nov 2022 09:25:25 -0800

In this newsletter:
Join LDC for membership year 2023
Fall 2022 data scholarship recipients
Spring 2023 data scholarship application deadline
30th Anniversary Highlight: CALLFRIEND


New publications:
BOLT English Translation Treebank - Egyptian Arabic 
SMS/Chat<https://catalog.ldc.upenn.edu/LDC2022T06>
Samrómur Children Icelandic Speech 1.0<https://catalog.ldc.upenn.edu/LDC2022S11>
Third DIHARD Challenge Development<https://catalog.ldc.upenn.edu/LDC2022S12>
________________________________
Join LDC for membership year 2023
It's time to renew your LDC membership for 2023. Current (2022) members who 
renew their membership before March 1, 2023 will receive a 10% discount. New or 
returning organizations will receive a 5% discount if they join the Consortium 
by March 1.

In addition to receiving new publications, current LDC members enjoy the 
benefit of licensing older data from our Catalog of 900+ holdings at reduced 
fees. Current-year for-profit members may use most data for commercial 
applications.

Plans for 2023 publications are in progress. Among the expected releases are:

  *   AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts: 156 
hours of Ukrainian conversational telephone speech and broadcast news with 1.2 
million words of corresponding orthographic transcripts
  *   2019 NIST SRE: audiovisual and leaderboard challenge sets based on 
amateur videos and Tunisian Arabic telephone speech, respectively
  *   DEFT English ERE: English text from assorted genres annotated for 
entities, relations and events
  *   Mixer 3 and Mixer 7 speech collections: thousands of hours of telephone 
speech and metadata from Mixer 3 (multiple languages) and Mixer 7 (Spanish, 
plus interviews and transcript readings)
  *   CALLFRIEND Russian: 100 telephone conversations among native speakers, 
transcripts and a lexicon, released in separate speech and text data sets
  *   REMIX Telephone Collection: English telephone speech from 385 
participants in previous Mixer studies
  *   LORELEI: representative and incident language packs containing 
monolingual text, bi-text, translations, annotations, supplemental resources 
and related tools in various languages (e.g., Indonesian, Swahili, Tagalog, 
Tamil, Zulu)
For full descriptions of all LDC data sets, browse our 
Catalog<https://catalog.ldc.upenn.edu/>.  Visit Join 
LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership, user 
accounts and payment.
Fall 2022 LDC data scholarship recipients
LDC congratulates the following Fall 2022 data scholarship recipients:

  *   Nelson Filipe Costa: Concordia University (Canada); PhD, Machine 
Learning. Nelson is awarded a copy of Penn Discourse Treebank Version 3.0 
(LDC2019T05<https://catalog.ldc.upenn.edu/LDC2019T05>) for his work in 
discourse relationships and mapping.
  *   Paul Pope: University of Eastern Finland (Finland); MA, Linguistic Data 
Sciences. Paul is awarded a copy of ETS Corpus of Non-Native Written English 
(LDC2014T06<https://catalog.ldc.upenn.edu/LDC2014T06>) for his research on text 
classification.
  *   Abhinav Singh: Sharda University (India); PhD, Forensic Science. Abhinav 
is awarded a copy of TIMIT Acoustic-Phonetic Continuous Speech Corpus 
(LDC93S1<https://catalog.ldc.upenn.edu/LDC93S1>) for his research on forensic 
speech recognition.
  *   Lucas Zheng: Deerfield Academy (USA); High School Scholar. Lucas is 
awarded copies of Arabic Treebank Part 1 v. 4.1 
(LDC2010T13<https://catalog.ldc.upenn.edu/LDC2010T13>) and Arabic Treebank Part 
2 v. 3.1 (LDC2011T09<https://catalog.ldc.upenn.edu/LDC2011T09>)  for his work 
on analyzing syntactic and lexical similarities across MSA genres and 
POS-tagging for MSA.
Students can learn more about the LDC data scholarship program on the Data 
Scholarships 
page.<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>

Spring 2023 data scholarship application deadline
Applications are now being accepted through January 15, 2023 for the Spring 
2023 LDC data scholarship program which provides university students with 
no-cost access to LDC data. Consult the LDC Data 
Scholarship<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>s
 page for more information about program rules and submission requirements.
30th Anniversary Highlight: CALLFRIEND
The CALLFRIEND series is a multi-language collection of unscripted telephone 
conversations conducted by LDC in the 1990s to support language identification 
technology development (Liberman & Cieri, 
1998<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec1998-creation-distribution-use.pdf>).
 Covered languages are American English, Canadian French, Egyptian Arabic, 
Farsi, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and 
Vietnamese. For English, Mandarin and Spanish, the collection includes two 
distinct dialects. Participants could speak with a person of their choice on 
any topic; most called family members and friends. All calls originated in 
North America.

This speech data was the foundation for NIST's Language Recognition 
Evaluations<https://www.nist.gov/itl/iad/mig/language-recognition> conducted 
from 1996-2007. The first editions of the CALLFRIEND series published in LDC's 
Catalog in 1996 contain 60 calls evenly split into 20 calls each for a training 
 partition to develop language models, a development partition for parameter 
tuning, and an evaluation partition to test performance (Torres-Carrasquillo, 
et al., 
2004<https://www.isca-speech.org/archive_open/archive_papers/odyssey_04/ody4_297.pdf>).

Beginning in 2014, LDC released second editions for American English 
(LDC2019S21<https://catalog.ldc.upenn.edu/LDC2019S21>, 
LDC2020S08<https://catalog.ldc.upenn.edu/LDC2020S08>), Canadian French 
(LDC2019S18<https://catalog.ldc.upenn.edu/LDC2019S18>), Egyptian Arabic 
(LDC2019S04<https://catalog.ldc.upenn.edu/LDC2019S04>), Farsi 
(LDC2014S01<https://catalog.ldc.upenn.edu/LDC2014S01>), and Mandarin Chinese 
(LDC2018S09<https://catalog.ldc.upenn.edu/LDC2018S09>, 
LDC2020S06<https://catalog.ldc.upenn.edu/LDC2020S06>).  The goal of the second 
editions is to facilitate continued widespread use of the data, specifically, 
by updating the audio files to .wav format, simplifying the directory 
structure, adding documentation and metadata, and combining the training, 
development and evaluation splits. CALLFRIEND Farsi Second Edition also 
includes additional telephone recordings and a separate transcripts release 
(LDC2014T01<https://catalog.ldc.upenn.edu/LDC2014T01>).

In addition to work on language identification, CALLFRIEND corpora have been 
used in a variety of research tasks, including subject omission in Korean (Lee 
2012<https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10893918>), 
contemporary Persian vowels in casual speech (Jones 
2019<https://repository.upenn.edu/pwpl/vol25/iss1/15/>), Mandarin telephone 
closings among familiars (Huang, 
2020<https://www.jbe-platform.com/content/journals/10.1075/ap.19017.hua>), and 
adjective constructions in English conversation (Bybee & Thompson, 
2021<https://www.mdpi.com/2226-471X/7/1/2/htm>), among many others.

To learn more about the CALLFRIEND collection or about other LDC corpora used 
for language identification research, search the 
Catalog<https://catalog.ldc.upenn.edu/search> by the "recommended application" 
and select "language identification" from the list.
________________________________
New publications:
BOLT English Translation Treebank - Egyptian Arabic 
SMS/Chat<https://catalog.ldc.upenn.edu/LDC2022T06> was developed by LDC and 
consists of SMS and chat text data (472 files representing 98,206 tokens) 
translated from Egyptian Arabic to English and annotated for part-of-speech and 
syntactic structure. Only the translated English text is included in the source 
data for this release. Part-of-speech and treebank annotation conformed to Penn 
Treebank II style, incorporating changes to those guidelines that were 
developed under the GALE (Global Autonomous Language Exploitation) program. 
Supplementary guidelines for English treebanks and web text are included in the 
corpus documentation.

2022 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
Samrómur Children Icelandic Speech 
1.0<https://catalog.ldc.upenn.edu/LDC2022S11> was developed by the Language and 
Voice Lab, Reykjavik University<https://lvl.ru.is/> in cooperation with 
Almannarómur, Center for Language Technology<https://almannaromur.is/>. The 
corpus contains 131 hours of Icelandic prompted speech from 3,175 speakers 
(children, aged 4-17 years) representing 137,597 utterances.

Speech data was collected between October 2019 and September 2021 using the 
Samrómur website<https://samromur.is> which displayed prompts to participants. 
The prompts were mainly from The Icelandic Gigaword 
Corpus<http://clarin.is/en/resources/gigaword>, which includes text from 
novels, news, plays, and from a list of location names in Iceland. Additional 
prompts were taken from the Icelandic Web of 
Science<https://www.visindavefur.is/> and others were created by combining a 
name followed by a question or a demand. Prompts and speaker metadata are 
included in the corpus

2022 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.
*
Third DIHARD Challenge Development<https://catalog.ldc.upenn.edu/LDC2022S12> 
was developed by LDC and contains approximately 34 hours of English and Chinese 
speech data along with corresponding annotations used in support of the Third 
DIHARD Challenge<https://dihardchallenge.github.io/dihard3>.

The DIHARD third development and evaluation sets were drawn from diverse 
sources including monologues, map task dialogues, broadcast interviews, 
sociolinguistic interviews, meeting speech, speech in restaurants, clinical 
recordings, and amateur web videos. Annotations include diarization and 
segmentation.

2022 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options; or contact LDC for assistance.

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] November 2022 Newsletter - LDC

Reply via email to