In this newsletter:
LDC data and commercial technology development

New publications:
2015 NIST Language Recognition Evaluation Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S02>
The Xi'an Multi-Language Learner 
Corpus<https://catalog.ldc.upenn.edu/LDC2025T03>

________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.
________________________________

New publications:
2015 NIST Language Recognition Evaluation Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S02> was developed by LDC and NIST. It 
contains the evaluation test set for the 2015 NIST Language Recognition 
Evaluation (LRE), approximately 867 hours of conversational telephone speech 
(CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages 
over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, 
Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin 
American, Brazilian Portuguese); English (British, Indian, General American 
English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); 
and French (West African, Haitian Creole).

The CTS data includes calls between individuals in the same social networks 
lasting 8-15 minutes and telephone speech from the IARPA Babel series collected 
in 2012-2013 from speakers using a range of phone types in diverse settings 
with varying noise conditions. The BNBS data was collected by LDC from 
streaming and satellite radio programming, focusing on programs that included 
narrowband speech (e.g., call-ins to a talk show).

The goal of NIST's LRE evaluations is to establish the baseline of current 
performance capability for CTS language recognition and to lay the groundwork 
for further research efforts. LRE15 expanded the range of test segment 
durations and added a test condition that allowed systems to make use of 
unrestricted training data when developing models

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

The Xi'an Multi-Language Learner 
Corpus<https://catalog.ldc.upenn.edu/LDC2025T03> was developed by Xi'an 
International Studies University (XISU)<https://en.xisu.edu.cn/> and is 
comprised of 526 argumentative essays in 15 languages by Chinese L1 university 
students studying second languages, along with student metadata and writing 
prompts. It was developed to support second language learner research and to 
provide a database for cross-linguistic comparison of second languages.

Data was collected in 2023 and 2024 from students at XISU and Yunnan Minzu 
University (YMU) who were linguistic majors or studying one of the foreign 
languages available at XISU and YMU. Off-topic essays and incomplete texts were 
excluded.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to