Fwd: FW: November 2018 Newsletter - LDC

lewis john mcgibbney Thu, 15 Nov 2018 10:39:02 -0800

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Thu, Nov 15, 2018 at 8:56 AM
Subject: FW: November 2018 Newsletter - LDC
To: lewis john mcgibbney <lewi...@apache.org>







Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibb...@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_25100943]



 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Thursday, November 15, 2018 at 8:54 AM
*To: *"'ldc-custome...@ldc.upenn.edu'" <ldc-custome...@ldc.upenn.edu>
*Subject: *November 2018 Newsletter - LDC



*In this newsletter:*



*Join LDC for Membership Year 2019*



*Spring 2019 Data Scholarship Program*



*Commercial use and LDC data*



*New publications:*

AISHELL-1 <https://catalog.ldc.upenn.edu/LDC2018S14>

Avatar Education Portuguese <https://catalog.ldc.upenn.edu/LDC2018S15>

BOLT Egyptian Arabic Treebank - Discussion Forum
<https://catalog.ldc.upenn.edu/LDC2018T23>

IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a
<https://catalog.ldc.upenn.edu/LDC2018S16>

*_________________________________________________________________________________*



*Join LDC for Membership Year 2019*

Membership Year 2019 (MY2019) is open and discounts are available for those
who keep their membership current and join early in the year. Now through
March 1, 2019, current MY2018 members who renew their LDC membership before
March 1 will receive a 10% discount off the membership fee. New or
returning organizations will receive a 5% discount through March 1.



In addition to receiving new publications, current LDC members also enjoy
the benefit of licensing older data at reduced costs from our Catalog of
over 750 holdings. Current-year for-profit members may use most data for
commercial applications.



Plans for MY2019 publications are in progress. Among the expected releases
are:

·         *SRI Speech-Based Collaborative Learning Corpus*: speech from
over 100 US middle school students performing collaborative learning tasks,
includes audio recordings, orthographic transcriptions, manual annotation
of collaboration, and related documentation

·         *Chinese Abstract Meaning Representation (AMR)*: developed by
Nanjing Normal University and Brandeis University, semantic representation
of approximately 10,000 Chinese sentences following the basic principles of
AMR using web source data from Chinese Treebank 8.0 (LDC2013T21)

·         *Multilanguage conversational telephone speech:* developed to
support language identification research in related languages (Arabic, East
Asian, English, Mandarin)

·         *TAC KBP:* English entity discovery and linking, nugget detection
and event argument data, Chinese slot-filling data

·         *CALLFRIEND Second Edition:* updated releases with .wav format
audio, simplified directory structure and enhanced documentation and
metadata (English, Egyptian Arabic, Mandarin Chinese-Taiwan)

·         *HAVIC Med Progress Test data:* English web video, metadata, and
annotations for developing multimedia systems

·         *IARPA Babel Language Packs (telephone speech and transcripts):*
languages include Amharic, Guarani, Igbo, and Lithuanian

·         *BOLT:* discussion forums, SMS, word-aligned and tagged data in
all languages (Chinese, Egyptian Arabic, English)



And, it’s not too late to join for MY2017 (through December 31, 2018) and
MY2018 (through December 31, 2019). Data sets from those years include 2010
NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and
Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely
Annotated New York Times and English Gigaword, DIRHA English WSJ Audio,
LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For
full descriptions of all LDC data sets, browse our Catalog
<https://catalog.ldc.upenn.edu/>.



Visit Join LDC <https://www.ldc.upenn.edu/members/join-ldc> for details on
membership, user accounts and payment.



*Spring 2019 Data Scholarship Program*

Applications are now being accepted through January 15, 2019 for the Spring
2019 LDC Data Scholarship program which provides university students with
no-cost access to LDC data. Consult the LDC Data Scholarship
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships> page
for more information about program rules and submission requirements.



*Commercial use and LDC data*

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or
for any commercial purpose. LDC data users should consult corpus-specific
license agreements for limitations on the use of certain corpora. Visit the
Licensing <https://www.ldc.upenn.edu/data-management/using/licensing> page
for further information.





*New publications:*



(1) AISHELL-1 <https://catalog.ldc.upenn.edu/LDC2018S14> was developed
by Beijing
Shell Shell Technology Co., Ltd. <http://www.aishelltech.com/sy> It
contains approximately 520 hours of Chinese Mandarin speech from 400
speakers recorded simultaneously on three different devices with associated
transcripts.



The goal of the collection was to support speech recognition system
development in 11 domains, including smart homes, autonomous driving,
entertainment, finance, and science and technology. Participants read 500
sentences covering the domains; sentences were chosen for their speech and
phonetic characteristics. The speech was recorded in a quiet indoor
environment on a high fidelity microphone and two mobile phones (Android
and IOS).



Speakers were recruited from different accent areas across China, including
North, South, and Yue-Gui-Min regions. There were 214 female speakers and
186 male speakers. Additional demographic information about the
participants is included in this release.



AISHELL-1 is distributed via hard drive.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $3,000.

*



(2) Avatar Education Portuguese <https://catalog.ldc.upenn.edu/LDC2018S15> was
developed by the University of Pernambuco <http://www.upe.br/> and consists
of approximately 80 minutes of Brazilian Portuguese microphone speech with
phonetic and orthographic transcriptions. The data was developed for Avatar
Education, an animated virtual assistant designed to enhance communication
and interaction in educational contexts, such as online learning.



The corpus contains 1,400 speakers (700 male, 700 female) who generated
1,400 utterances from read and spontaneous speech. Utterances were
transcribed at the word level (without time alignments) and at the phoneme
level (with time alignment labels).



Avatar Education Portuguese is distributed via web download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $100.

*



(3) BOLT Egyptian Arabic Treebank - Discussion Forum
<https://catalog.ldc.upenn.edu/LDC2018T23> was developed by LDC and
consists of Egyptian Arabic web discussion forum data with part-of-speech
annotation, morphology, gloss and syntactic tree annotation collected for
the DARPA Broad Operational Language Translation (BOLT) Program.



The annotations in this release follow Penn Arabic Treebank (PATB)
annotation guidelines. There are two kinds of morphological analysis
synchronized in the corpus. LDC Standard Morphological Analyzer (SAMA)
Version 3.1 (LDC2010L01 <https://catalog.ldc.upenn.edu/LDC2010L01>) was
used for Modern Standard Arabic tokens, and CALIMA (Columbia Arabic
Language and dIalect Morphological Analyzer) was used for Egyptian-Arabic
tokens.



This release contains 440,448 tokens before clitics were split and 508,548
tree tokens after clitics were split for treebank annotation. The source
material is web discussion forums collected by LDC from various sources.



The unannotated Egyptian Arabic source data is released as BOLT Arabic
Discussion Forums (LDC2018T10 <https://catalog.ldc.upenn.edu/LDC2018T10>).



BOLT Egyptian Arabic Treebank - Discussion Forum is distributed via web
download.



2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $4,500.

*



(4) IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a
<https://catalog.ldc.upenn.edu/LDC2018S16> was developed by Appen
<http://www.appen.com/> for the IARPA (Intelligence Advanced Research
Projects Activity) Babel
<http://www.iarpa.gov/index.php/research-programs/babel> program. It
contains approximately 201 hours of Telugu conversational and scripted
telephone speech collected in 2013 and 2014 along with corresponding
transcripts.



The Telugu speech in this release represents that spoken in the Central,
East, South, and North Telugu dialect regions of India. The gender
distribution among speakers is approximately equal; speakers' ages range
from 16 years to 65 years. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments including the
street, a home or office, a public place, and inside a vehicle.



IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a is available via web
download.



2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for $25.



*



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104












-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Fwd: FW: November 2018 Newsletter - LDC

Reply via email to