In this newsletter:
LDC membership discounts expire March 1
30th Anniversary Highlight: Arabic Treebank

New publications:
2019 NIST Speaker Recognition Evaluation Test Set - 
Audio-Visual<https://catalog.ldc.upenn.edu/LDC2023V01>
LORELEI Tagalog Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T02>
________________________________
LDC membership discounts expire March 1
Time is running out to save on 2023 membership fees. Renew your LDC membership, 
rejoin the Consortium, or become a new member by March 1 to receive a discount 
of up to 10%. For more information on membership benefits and options, visit 
Join LDC<https://www.ldc.upenn.edu/members/join-ldc>.

30th Anniversary Highlight: Arabic Treebank
The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the 
DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original 
focus was on Modern Standard Arabic (MSA), not natively spoken and not 
homogenously acquired across its writing and reading community. In addition to 
the expected issues associated with complex data annotation, LDC encountered 
several challenges unique to a highly inflected language with a rich history of 
traditional grammar. LDC relied on traditional Arabic grammar, as well as 
established and modern grammatical theories of MSA -- in combination with the 
Penn Treebank approach to syntactic annotation -- to design an annotation 
system for Arabic. (Maamouri, et al., 
2004<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/nemlar2004-penn-arabic-treebank.pdf>).
 LDC was innovative with respect to traditional grammar when necessary and when 
other syntactic approaches were found to account for the data. LDC also 
developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic 
Morphological Analyzer (SAMA) Version 3.1 
(LDC2010L01<https://catalog.ldc.upenn.edu/LDC2010L01>), which greatly benefited 
ATB development. Revisions to the annotation guidelines during the DARPA GALE 
program (principally related to tokenization and syntactic annotation) improved 
inter-annotator agreement and parsing scores.

ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic 
structure.  Data sets based on MSA newswire developed under the revised 
annotation guidelines include Arabic Treebank: Part 1 v 4.1 
(LDC2010T13<https://catalog.ldc.upenn.edu/LDC2010T13>), Arabic Treebank: Part 2 
v 3.1 (LDC0211T09<https://catalog.ldc.upenn.edu/LDC2011T09>), and Arabic 
Treebank: Part 3 v 3.2 (LDC2010T08<https://catalog.ldc.upenn.edu/LDC2010T08>). 
Other genres are represented in Arabic Treebank - Broadcast News v 1.0 
(LDC2012T07<https://catalog.ldc.upenn.edu/LDC2012T07>) and Arabic Treebank - 
Weblog (LDC2016T02<https://catalog.ldc.upenn.edu/LDC2016T02>).

LDC's later work on Egyptian Arabic treebanks in the DARPA BOLT program 
benefited from the strides in its MSA treebank annotation pipeline. As for the 
challenges presented by informal, dialectal material, collaborator Columbia 
University provided a normalized Arabic orthography to account for instances of 
Romanized script (Arabizi) in the data and developed a morphological analyzer 
(CALIMA) in parallel, working in a tight feedback loop with LDC's annotation 
team. SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the 
former used for MSA tokens and the latter used for Egyptian Arabic tokens. 
Resulting corpora include BOLT Egyptian Arabic Treebank - Discussion Forum 
(LDC2018T23<https://catalog.ldc.upenn.edu/LDC2018T23>), Conversational 
Telephone Speech (LDC2021T12<https://catalog.ldc.upenn.edu/LDC2021T12>), and 
SMS/Chat (LDC2021T17<https://catalog.ldc.upenn.edu/LDC2021T17>).

ATB corpora and its related releases are available for licensing to LDC members 
and nonmembers. For more information about licensing LDC data, visit Obtaining 
Data<https://www.ldc.upenn.edu/language-resources/data/obtaining>
________________________________
New publications:
2019 NIST Speaker Recognition Evaluation Test Set - 
Audio-Visual<https://catalog.ldc.upenn.edu/LDC2023V01> contains approximately 
64 hours of English audio-visual data for development and test, answer keys, 
enrollment, trial files, and documentation from the NIST-sponsored 2019 Speaker 
Recognition Evaluation 
(SRE)<https://www.nist.gov/itl/iad/mig/nist-2019-speaker-recognition-evaluation>.

The 2019 evaluation task was speaker detection, that is, to determine whether a 
specified target speaker was speaking during a segment of speech. The 
evaluation was conducted in two parts: (1) a leaderboard-style challenge based 
on conversational telephone speech and (2) a separate evaluation using 
audio-visual data. This release relates to the audio-visual evaluation.

The source audio-visual data was collected by LDC for the VAST (Video 
Annotation for Speech Technology) project. That collection focused on amateur 
video recordings from various online media hosting services. The recordings 
vary in duration from 17.5 seconds to 13 minutes; most have two audio channels 
(stereo), but some are monophonic (one channel).

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
LORELEI Tagalog Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T02> was developed by LDC and is 
comprised of approximately 4.8 million words of Tagalog monolingual text, 
341,000 words of found Tagalog-English parallel text, and 124,000 Tagalog words 
translated from English data. Approximately 78,000 words were annotated for 
named entities and over 26,000 words were annotated for entity discovery and 
linking and situation frames (identifying entities, needs and issues). Data was 
collected from discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104




_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to