[Corpora-List] July 2022 Newsletter - LDC

Penn LDC via Corpora Fri, 15 Jul 2022 08:29:46 -0700

In this newsletter:
Fall 2022 LDC Data Scholarship Program
30th Anniversary Highlight: ATIS0 Complete


New publications:
Qatari Corpus of Argumentative Writing<https://catalog.ldc.upenn.edu/LDC2022T04>
Second DIHARD Challenge Evaluation - 
SEEDLingS<https://catalog.ldc.upenn.edu/LDC2022S07>
________________________________
Fall 2022 LDC Data Scholarship Program
Student applications for the Fall 2022 LDC Data Scholarship program are being 
accepted now through September 15, 2022. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

30th Anniversary Highlight: ATIS0 Complete
The ATIS corpora were among the first publications that appeared with the 
launch of LDC's catalog in 1993. ATIS0 Complete 
(LDC93S4A)<https://catalog.ldc.upenn.edu/LDC93S4A> is comprised of spontaneous 
speech, read speech, and other material from participants in the ATIS 
collection that is contained in ATIS0 Pilot 
(LDC93S4B),<http://catalog.ldc.upenn.edu/LDC93S4B> ATIS0 Read 
(LDC93S4B-2)<http://catalog.ldc.upenn.edu/LDC93S4B-2>, and ATIS0 SD-Read 
(LDC93S4B-3<http://catalog.ldc.upenn.edu/LDC93S4B-3>).

The ATIS (Air Travel Information Services) collection was developed to support 
the research and development of speech understanding systems. Participants were 
presented with various hypothetical travel planning scenarios and asked to 
solve them by interacting with partially or completely automated ATIS systems. 
The resulting utterances were recorded and transcribed. Data was collected in 
the early 1990s at five US sites: Raytheon BBN, Carnegie Mellon University, MIT 
Laboratory for Computer Science, National Institute for Standards and 
Technology, and SRI International.

The ATIS collection has been widely used to further research in spoken language 
understanding and slot filling (Kuo et al., 
2020<https://arxiv.org/pdf/2009.14386.pdf>). Other data sets published from the 
collection include ATIS2 (LDC93S5)<https://catalog.ldc.upenn.edu/LDC93S5>, 
ATIS3 Training and Test Data (LDC94S19,<https://catalog.ldc.upenn.edu/LDC94S19> 
LDC95S26<https://catalog.ldc.upenn.edu/LDC95S26>) and, more recently, 
Multilingual ATIS (LDC2019T04)<https://catalog.ldc.upenn.edu/LDC2019T04> and 
ATIS - Seven Languages (LDC2021T04)<https://catalog.ldc.upenn.edu/LDC2021T04>.

All ATIS corpora are available for licensing by Consortium members and 
non-members. Visit Obtaining Data 
<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more 
information.
________________________________
New publications:

(1) Qatari Corpus of Argumentative 
Writing<https://catalog.ldc.upenn.edu/LDC2022T04>  was developed by Qatar 
University<http://www.qu.edu.qa/>, University of 
Exeter<https://www.exeter.ac.uk/>, and Hamad Bin Khalifa 
University<https://www.hbku.edu.qa/en>, and is comprised of approximately 
200,000 tokens of Arabic and English writing by undergraduate students (159 
female, 36 male) along with annotations and related metadata. Students were 
native Arabic speakers and fluent in English; each student wrote one Arabic and 
one English essay in response to specific argumentative prompts. They were 
instructed to include in their essays a clear thesis statement supported by 
relevant evidence.

The corpus is divided into Arabic and English parts, each of which contains 195 
essays. Metadata includes information about the students (gender, major, first 
language, second language) and information about the essay texts (serial 
numbers of texts, word limits, genre, date of writing, time spent on writing, 
place of writing).

Qatari Corpus of Argumentative Writing is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. Non-members may license this data for a fee.
*

(2) Second DIHARD Challenge Evaluation - 
SEEDLingS<https://catalog.ldc.upenn.edu/LDC2022S07> was developed by Duke 
University and LDC and contains approximately two hours of English child 
language recordings along with corresponding annotations used in support of the 
Second DIHARD Challenge<https://dihardchallenge.github.io/dihard2>.

Source data is from the 
SEEDLingS<https://homebank.talkbank.org/access/Password/Bergelson.html> (The 
Study of Environmental Effects on Developing Linguistic Skills) corpus, 
designed to investigate how infants' early linguistic and environmental input 
plays a role in their learning. Recordings were generated in the home 
environment of infants in the Rochester, New York area. A subset of that data 
was annotated by LDC for use in the First and Second DIHARD Challenges

Second DIHARD Challenge Evaluation - SEEDLingS is distributed via web download.

2022 Subscription Members will automatically receive copies of this corpus 
provided they have submitted a completed copy of the special license agreement. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options; or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] July 2022 Newsletter - LDC

Reply via email to