Fwd: FW: April 2017 Newsletter -- LDC

lewis john mcgibbney Mon, 17 Apr 2017 13:13:23 -0700

Hi Folks,
FYI
Lewis

---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Mon, Apr 17, 2017 at 10:46 AM
Subject: FW: April 2017 Newsletter -- LDC
To: lewis john mcgibbney <lewi...@apache.org>







Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group 398M

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402 <(818)%20393-7402>

Cell: (+1) (626)-487-3476 <(626)%20487-3476>

Fax:  (+1) (818)-393-1190 <(818)%20393-1190>

Email: lewis.j.mcgibb...@jpl.nasa.gov







 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Monday, April 17, 2017 at 8:05 AM
*To: *Penn LDC <l...@ldc.upenn.edu>
*Subject: *April 2017 Newsletter -- LDC



*In this newsletter*

*LDC celebrates 25 years *

*LDC data and commercial technology development *

*New publications:*

2010 NIST Speaker Recognition Evaluation Test Set
<https://catalog.ldc.upenn.edu/LDC2017S06>

BOLT Egyptian Arabic SMS/Chat and Transliteration
<https://catalog.ldc.upenn.edu/LDC2017T07>

CHiME2 Grid <https://catalog.ldc.upenn.edu/LDC2017S07>

____________________________________________________________
_____________________

*LDC celebrates 25 years*

April 2017 marks the beginning of LDC’s 25th year as the leader in language
resource development and distribution. Founded in 1992, the Consortium has
grown from a data repository to a vibrant data center that creates, shares
and archives language resources. The Catalog continues to grow, boasting
over 700 titles in more than 90 languages. With the support of members,
licensees, sponsors and collaborators, LDC has distributed over 120,000
copies of data to more than 3,500 organizations worldwide. Our heartfelt
thanks for your support as we continue our mission to provide large
quantities of diverse data, research program support and high quality
member services.



*LDC data and commercial technology development *

Any organization wishing to use LDC data to develop or test products for
commercialization or use LDC data in any commercial product or for any
commercial purpose, must first license the data as a For-Profit Member.
Once the data is licensed under the For-Profit Membership, the organization
retains perpetual rights to use the data for commercial technology
development. LDC data users should consult corpus-specific license
agreements for limitations on the use of certain corpora. Visit our
Licensing <https://www.ldc.upenn.edu/data-management/using/licensing> page
for more information.



*New Corpora*

(1) 2010 NIST Speaker Recognition Evaluation Test Set
<https://catalog.ldc.upenn.edu/LDC2017S06> was developed by LDC and NIST
(National Institute of Standards and Technology). It contains 2,255 hours
of American English telephone speech and interview speech recorded over a
microphone channel used as test data in the NIST-sponsored 2010 Speaker
Recognition Evaluation (SRE)
<http://www.itl.nist.gov/iad/mig/tests/spk/2010/index.html>.

The telephone speech segments include two-channel excerpts of approximately
10 seconds and 5 minutes. There are also summed-channel excerpts in the
range of 5 minutes. The microphone excerpts are 3-15 minutes in duration.
As in prior evaluations, intervals of silence were not removed.

The 2010 evaluation includes not only conversational telephone speech (CTS)
recorded over ordinary telephone channels for the core training and test
conditions, but also CTS and conversational interview speech recorded over
a room microphone channel. Unlike prior evaluations, some of the
conversational telephone style speech was collected in a manner to produce
particularly high, or particularly low, vocal effort on the part of the
speaker of interest. In addition to evaluation data, this package also
consists of answer keys, trial and train files, development data and
evaluation documentation.

2010 NIST Speaker Recognition Evaluation Test Set is distributed via hard
drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US $4000.

*

(2) BOLT Egyptian Arabic SMS/Chat and Transliteration
<https://catalog.ldc.upenn.edu/LDC2017T07> was developed by LDC and
consists of naturally-occurring Short Message Service (SMS) and Chat (CHT)
data collected through data donations and live collection involving native
speakers of Egyptian Arabic. The corpus contains 5,691 conversations
totaling 1,029,248 words across 262,026 messages. Messages were natively
written in either Arabic orthography or romanized Arabizi. A total of 1,856
Arabizi conversations (287,022 words) were transliterated from the original
romanized Arabizi script into standard Arabic orthography and then
reviewed, corrected and normalized by LDC annotators according to
"Conventional Orthography for Dialectal Arabic" (CODA).



The BOLT <https://www.ldc.upenn.edu/collaborations/current-projects/bolt>
(Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content. LDC supported the BOLT program by collecting
informal data sources -- discussion forums, text messaging and chat -- in
Chinese, Egyptian Arabic and English. The collected data was translated and
annotated for various tasks including word alignment, treebanking,
propbanking and co-reference.



BOLT Egyptian Arabic SMS/Chat and Transliteration is distributed via web
download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US $1750.

*

(3) CHiME2 Grid <https://catalog.ldc.upenn.edu/LDC2017S07> was developed as
part of The 2nd CHiME Speech Separation and Recognition Challenge
<http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/> and contains
approximately 120 hours of English speech from a noisy living room
environment. The CHiME Challenges focus on distant-microphone automatic
speech recognition (ASR) in real-world environments.

CHiME2 Grid reflects the small vocabulary track
<http://spandh.dcs.shef.ac.uk/chime_challenge/chime2013/chime2_task1.html> of
the CHiME2 Challenge. The target utterances were taken from the Grid corpus
<http://spandh.dcs.shef.ac.uk/gridcorpus/> and consist of 34 speakers
reading simple 6-word sequences. The Data is divided into training,
development and test sets.

CHiME2 Grid is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US $50.



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275 <(215)%20573-1275>

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104







-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Fwd: FW: April 2017 Newsletter -- LDC

Reply via email to