[Corpora-List] September 2022 Newsletter - LDC

Penn LDC via Corpora Thu, 15 Sep 2022 09:15:39 -0700

In this newsletter:
Upcoming Policy Change to LDC's Open Memberships
LDC at Interspeech 2022
LanguageARC: Citizen Science for Language
30th Anniversary Highlight: Switchboard


New publications:
Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09>
MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08>
________________________________
Upcoming Policy Change to LDC's Open Memberships
LDC is changing Its open membership year policy beginning January 1, 2023.  
Only one membership year will be open for joining - the current membership 
year. The 2022 membership year will close for joining on December 31, 2022. We 
expect this change to have a minimal impact on members, while allowing us to 
streamline our processes to serve members better. LDC's many membership 
benefits<https://www.ldc.upenn.edu/members/benefits> will remain the same and 
organizations choosing to join membership years in advance will still be able 
to do so. If you have any questions about this change, please don't hesitate to 
contact our membership office<mailto:[email protected]>.

LDC at Interspeech 2022
LDC is proud to sponsor the Workshop for Young Female Researchers in 
Speech<https://sites.google.com/view/yfrsw-2022/> (YFRSW) to be held in-person 
as an Interspeech 2022<https://interspeech2022.org/> pre-conference satellite 
event on September 17. Also, be sure to check out the collaborative work of 
LDC's Mark Liberman, "The mapping between syntactic and prosodic phrasing in 
English and Mandarin", presented during the On-Site Oral Session: Phonetics and 
Phonology on Wednesday, September 21, 13:30-15:30 KST.

LanguageARC: Citizen Science for Language
LanguageARC<https://languagearc.com> is a citizen science web portal for 
language research developed by LDC with the support of the National Science 
Foundation (grant #1730377).

LanguageARC brings together researchers and participants from the general 
public interested in language to form a community dedicated to support and 
advance language-related research and development. Contributors to this online 
community can participate in a variety of language-related tasks and activities 
such as reading text, answering questions, describing images or video, creating 
or evaluating transcriptions for audio clips, or developing translations into 
their native languages. LanguageARC includes projects in languages other than 
English, such as French, Sesotho, and Swedish. Xi'an Guanzhong Object Naming 
LDC2022S09<https://catalog.ldc.upenn.edu/LDC2022S09>, released this month in 
LDC's Catalog and described below, is an example of a data set developed using 
LanguageARC. New projects will be added on an ongoing basis.

Sign up for a LanguageARC account today to start making real contributions to 
language knowledge and research. Please share this information with colleagues, 
students, and anyone who might be interested in participating in the language 
activities on this website. If you are a researcher interested in creating a 
project on Language ARC, please reach out on the site's 
Contact<https://languagearc.com/messages/new> page.

Find LanguageARC on Facebook at: https://www.facebook.com/languagearc

30th Anniversary Highlight: Switchboard
Switchboard-1 Release 2 (LDC97S62<https://catalog.ldc.upenn.edu/LDC97S62>) is 
considered the first large collection of spontaneous conversational telephone 
speech (Graff & Bird, 
2000<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2000-many-uses-many-annotations.pdf>).
 It consists of approximately 260 hours of recordings collected by Texas 
Instruments in 1990-1991  (Godfrey et al., 
1992<https://isip.piconepress.com/projects/switchboard/doc/education/papers/paper_1.pdf>).
 The first release of the corpus (later superseded) was published by NIST and 
distributed by LDC in 1993.

Participants were 543 speakers (302 male, 241 female) from across the United 
States who accounted for around 2,400 two-sided telephone conversations. A 
robot operator handled the calls, giving the caller appropriate recorded 
prompts, selecting and dialing another person (the callee) to take part in a 
conversation, introducing a topic for discussion, and recording the speech from 
the two subjects into separate channels until the conversation was finished. 
Roughly 70 topics were provided, of which about 50 were used frequently. 
Selection of topics and callees was constrained so that: (1) no two speakers 
would converse together more than once and (2) no one spoke more than once on a 
given topic.

This gold standard data set has been used for many HLT applications, including 
speaker identification, speaker authentication, and speech recognition. It is 
considered one of the most important benchmarks for recognition tasks involving 
large vocabulary conversational speech (Deshmukh et al., 
1998<https://www.researchgate.net/profile/Aravind-Ganapathiraju/publication/221489932_Resegmentation_of_SWITCHBOARD/links/5610c41c08ae0fc513f155bb/Resegmentation-of-SWITCHBOARD.pdf>)
 as well as a key resource for studying the phonetic properties of spontaneous 
speech (Greenberg et al., 
1996<https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.3655&rep=rep1&type=pdf>).
 Annotation tasks based on Switchboard include discourse tags/speech acts, 
part-of-speech tagging and parsing, and sentiment 
analysis<https://catalog.ldc.upenn.edu/LDC2020T14>.

The Switchboard series includes Switchboard Credit 
Card<https://catalog.ldc.upenn.edu/LDC93S8>, Phase 
II<https://catalog.ldc.upenn.edu/LDC98S75>, Phase 
III<https://catalog.ldc.upenn.edu/LDC2002S06>, the Switchboard 
Cellular<https://catalog.ldc.upenn.edu/LDC2001S13> collection, and new 
recordings from 18 Switchboard participants in the 2013 
Greybeard<https://catalog.ldc.upenn.edu/LDC2013S05> corpus.

All Switchboard corpora are available in the Catalog for licensing by 
Consortium members and non-members. Visit Obtaining 
Data<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more 
information.
________________________________
New publications:
Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09> is 
comprised of 15 hours of audio recordings from speakers of the Guanzhong 
dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province 
(China) naming objects that appeared in colored line drawings. The corpus was 
developed to support traditional and computer aided language documentation.

The collection was conducted from February-May 2021 using 
LanguageARC<https://languagearc.com/>, a citizen science portal developed by 
LDC, from a closed volunteer community. Speakers were presented with images 
selected from the MultiPic dataset<https://www.bcbl.eu/databases/multipic> and 
were asked to record themselves naming the objects in the images.

Xi'an Guanzhong Object Naming is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. Non-members may license this data for a fee.
*
MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08> MASRI (Maltese 
Automatic Speech Recognition I) Synthetic was developed by the MASRI team 
<https://www.um.edu.mt/projects/masri/> at the University of 
Malta<https://www.um.edu.mt/> and contains 99 hours of synthesized Maltese 
speech.

Source sentences were extracted from the Maltese Language Resource 
Server<https://mlrs.research.um.edu.mt/index.php?page=corpora> (MLRS) corpus, 
comprised of written or transcribed Maltese covering various genres, including 
parliamentary debates, news, law, opinion, sports, culture, academic, 
literature, and religious texts. Text was processed through the CrimsonWing 
text-to-speech system to generate speech files. Synthesized speech was created 
with 210 voices (105 female, 105 male).

MASRI Synthetic is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus 
provided they have submitted a completed copy of the special license agreement. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. Non-members may license this data for a fee.


Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] September 2022 Newsletter - LDC

Reply via email to