In this newsletter:
LDC at ACL 2023
LDC data and commercial technology development

New publications:
Moroccan Arabic - English Lexical 
Database<https://catalog.ldc.upenn.edu/LDC2023L01>
LORELEI Indonesian Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T07>
________________________________
LDC at ACL 202
LDC will be exhibiting at ACL 2023, held this year July 9-14 in Toronto, 
Canada. Stop by our booth to learn more about recent developments at the 
Consortium and the latest publications. LDC will post conference updates via 
Twitter and Facebook. We look forward to seeing you there!

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.
________________________________
New publications:
Moroccan Arabic - English Lexical 
Database<https://catalog.ldc.upenn.edu/LDC2023L01> was developed by LDC. It 
contains a set of five interrelated tables presenting each Moroccan Arabic word 
as an orthographic form in Arabic script and a pronunciation form in 
International Phonetic Alphabet (IPA) format. This release contains over 21,000 
Moroccan Arabic words in Arabic script and IPA notation, and more than 33,000 
English tokens.

This lexical database is the result of a collaboration with Georgetown 
University Press (GUP) <https://press.georgetown.edu/> to enhance and update 
three dialectal Arabic dictionaries -- Iraqi, Moroccan, and Syrian -- 
originally published in paper form in the 1960s by GUP.  LDC also undertook to 
develop a lexical database for each dialect. The Georgetown Dictionary of 
Moroccan Arabic 
<https://press.georgetown.edu/Book/The-Georgetown-Dictionary-of-Moroccan-Arabic#:~:text=The%20Georgetown%20Dictionary%20of%20Moroccan%20Arabic%20is%20a%20modernized%20language,Press%20over%20fifty%20years%20ago>
 was published in 2019; this work was based on, and expanded, A Dictionary of 
Moroccan 
Arabic<https://press.georgetown.edu/Book/A-Dictionary-of-Moroccan-Arabic>.

The several enhancements developed by LDC included facilitating comparisons 
across Arabic dialects and Modern Standard Arabic by providing Arabic script 
spellings and IPA pronunciations to Moroccan words and phrases; promoting ease 
of use by language learners and researchers by developing reasonable 
orthographic conventions for applying the Arabic alphabet to the dialect; and 
facilitating a user's understanding of morphological and lexical relations by 
adding information on the linguistic structures of Moroccan Arabic.

2023 members can access this corpus through their LDC accounts provided they 
have submitted a signed copy of the special license agreement. Non-members may 
license this data for a fee.
*
LORELEI Indonesian Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2023T07> is comprised of over 17 million 
words of Indonesian monolingual text, 950,000 million words of found 
Indonesian-English parallel text, and 92,000 Indonesian words translated from 
English data. Over 113,000 words were annotated for named entities and more 
than 24,000 words were annotated for entity discovery and linking and situation 
frames (identifying entities, needs, and issues). Data was collected from 
discussion forum, news, reference, social network, and weblogs.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. Representative languages were selected to 
provide broad typological coverage.

The knowledge base for entity linking annotation is available separately as 
LORELEI Entity Detection and Linking Knowledge Base 
(LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104



_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to