[Corpora-List] November 2023 Newsletter - LDC

Penn LDC via Corpora Wed, 15 Nov 2023 09:47:18 -0800

In this newsletter:
Join LDC for Membership Year 2024
Spring 2024 data scholarship application deadline


New publications:
REMIX Telephone Collection<https://catalog.ldc.upenn.edu/LDC2023S09>
News Sub-domain Named Entity 
Recognition<https://catalog.ldc.upenn.edu/LDC2023T12>
________________________________
Join LDC for Membership Year 2024
It's time to renew your LDC membership for 2024. Current (2023) members who 
renew their membership before March 1, 2024 will receive a 10% discount. New or 
returning organizations will receive a 5% discount if they join the Consortium 
by March 1.

In addition to receiving new publications, current LDC members enjoy the 
benefit of licensing older data from our Catalog of 940+ holdings at reduced 
fees. Current-year for-profit members may use most data for commercial 
applications.

Plans for 2024 publications are in progress. Among the expected releases are:

  *   KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational 
telephone speech and web broadcasts, 65 hours transcribed
  *   AIDA Topic Source Data and Annotations: multimodal source data and 
annotations in multiple languages (Russian, Ukrainian, English, Spanish) for 
information and entity extraction
  *   RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, 
Persian, Pushto, and Urdu audio files selected from RATS speech activity 
detection and keyword spotting data sets, also including communications systems 
sounds and silence
  *   Call My Net 1: 364 hours of conversational telephone speech recordings in 
Tagalog, Cebuano, Cantonese and Mandarin from speakers in the Philippines and 
China using various handsets under diverse noise conditions

  *   Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 
433 native speakers with transcripts
  *   Diaspora Tibetan Speech: elicited, read, and spontaneous speech from 73 
native Tibetan speakers in Katmandu's diaspora Tibetan community, some 
recordings transcribed

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations, and queries in multiple 
languages (e.g., Bulgarian, Somali, Georgian)
  *   LORELEI: representative and incident language packs containing 
monolingual text, bi-text, translations, annotations, supplemental resources, 
and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)
For full descriptions of all LDC data sets, browse our 
Catalog<https://catalog.ldc.upenn.edu/>. Visit Join 
LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership, user 
accounts and payment.

Spring 2024 data scholarship application deadline
Applications are now being accepted through January 15, 2024, for the Spring 
2024 LDC data scholarship program, which provides university students with 
no-cost access to LDC data. Consult the LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>
 page for more information about program rules and submission requirements.
________________________________
New publications:

REMIX Telephone Collection<https://catalog.ldc.upenn.edu/LDC2023S09> was 
developed by LDC and contains 320 hours of English conversational telephone 
speech from 358 speakers who had completed all tasks in one of the previous LDC 
Mixer collections, specifically, Mixers 4-7. The data was collected in 2012; 
recordings in this corpus were used to support the NIST 2012 Speaker 
Recognition Evaluation. Speakers completed up to 12 calls lasting up to 10 
minutes conversing on suggested topics. They were asked that half of the calls 
be made in a "noisy" environment, e.g., from a speakerphone, a busy street, 
noisy store or office, or a room with loud background noise. Speaker metadata 
is included.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

News Sub-domain Named Entity 
Recognition<https://catalog.ldc.upenn.edu/LDC2023T12> was developed at the 
University of Pennsylvania and contains over 20,000 English news sentences 
annotated with named entities and categorized into sub-domains. The sentences 
were extracted from The New York Times Annotated Corpus 
(LDC2008T19)<https://catalog.ldc.upenn.edu/LDC2008T19>. Named entity annotation 
was based on the CoNLL-2003 guidelines and annotation 
scheme<https://paperswithcode.com/dataset/conll-2003>. Sentences were labeled 
with person (PER), location (LOC) and organization (ORG) tags using phrase 
matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), 
Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, 
Metropolitan, Sports, and Others. "Others" includes topics such as Real Estate, 
New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness.

2023 members can access this corpus through their LDC accounts provided they 
have submitted a signed copy of the special license agreement. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] November 2023 Newsletter - LDC

Reply via email to