We are very happy to announce the nineteenth release of annotated
treebanks in Universal Dependencies, v2.13, available at
http://universaldependencies.org/.
Universal Dependencies is a project that seeks to develop
cross-linguistically consistent treebank annotation for many languages
with the goal of facilitating multilingual parser development,
cross-lingual learning, and parsing research from a language typology
perspective (de Marneffe et al., 2021; Nivre et al., 2020). The
annotation scheme is based on (universal) Stanford dependencies (de
Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags
(Petrov et al., 2012), and the Interset interlingua for morphosyntactic
tagsets (Zeman, 2008). The general philosophy is to provide a universal
inventory of categories and guidelines to facilitate consistent
annotation of similar constructions across languages, while allowing
language-specific extensions when necessary.
The *259* treebanks in v2.13 are annotated according to version 2 of the
UD guidelines and represent the following *148 languages:* Abaza,
Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek, Ancient
Hebrew, Apurina, Arabic, Armenian, Assyrian, Bambara, Basque, Beja,
Belarusian, Bengali, Bhojpuri, Bororo, Breton, Bulgarian, Buryat,
Cantonese, Catalan, Cebuano, Chinese, Chukchi, Classical Armenian,
Classical Chinese, Coptic, Croatian, Czech, Danish, Dutch, English,
Erzya, Estonian, Faroese, Finnish, French, Frisian Dutch, Galician,
Georgian, German, Gheg, Gothic, Greek, Guajajara, Guarani, Haitian
Creole, Hebrew, Highland Puebla Nahuatl, Hindi, Hittite, Hungarian,
Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kaapor,
Kangri, Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak, Komi
Zyrian, Korean, Kurmanji, Kyrgyz, Latin, Latvian, Ligurian, Lithuanian,
Livvi, Low Saxon, Macedonian, Madi, Maghrebi Arabic French, Makurap,
Malayalam, Maltese, Manx, Marathi, Mbya Guarani, Middle French, Moksha,
Munduruku, Naija, Nayini, Neapolitan, Nheengatu, North Sami, Norwegian,
Old Church Slavonic, Old East Slavic, Old French, Old Irish, Old
Turkish, Persian, Polish, Pomak, Portuguese, Romanian, Russian,
Sanskrit, Scottish Gaelic, Serbian, Sinhala, Skolt Sami, Slovak,
Slovenian, Soi, South Levantine Arabic, Spanish, Swedish, Swedish Sign
Language, Swiss German, Tagalog, Tamil, Tatar, Teko, Telugu, Thai,
Tupinamba, Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian,
Urdu, Uyghur, Veps, Vietnamese, Warlpiri, Welsh, Western Armenian,
Western Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba,
Yupik and Zaar. The 148 languages belong to *31 families:* Afro-Asiatic,
Arawakan, Arawan, Austro-Asiatic, Austronesian, Basque, Bororoan,
Chukotko-Kamchatkan, Code switching, Creole, Dravidian, Eskimo-Aleut,
Indo-European, Japanese, Kartvelian, Korean, Macro-Je, Mande, Mayan,
Mongolic, Niger-Congo, Northwest Caucasian, Pama-Nyungan, Sign Language,
Sino-Tibetan, Tai-Kadai, Tungusic, Tupian, Turkic, Uralic and
Uto-Aztecan. Depending on the language, the treebanks range in size from
less than 1,000 tokens to over 3 million tokens. We expect the next
release to be available in May 2024.
The size of the following 23 treebanks changed significantly since the
last release:
Ancient Greek PTNK : 0 → 39509
Beja NSC : 857 → 1206
Bororo BDT : 692 → 1905
Chinese Beginner : 0 → 19999
Chinese PatentChar : 2160 → 4784
Classical Armenian CAVaL : 0 → 13522
Czech Poetry : 0 → 6288
Georgian GLC : 0 → 2335
Haitian Creole Autogramm : 0 → 3278
Highland Puebla Nahuatl ITML: 0 → 10103
Italian Old : 0 → 41367
Low Saxon LSDC : 2935 → 4683
Macedonian MTB : 0 → 1360
Middle French PROFITEROLE : 0 → 12025
Nheengatu CompLin : 8604 → 12743
Old East Slavic Ruthenian : 10011 → 96803
Old French SRCMF : 199699 → 0
Old French PROFITEROLE : 0 → 227137
Portuguese GSD : 0 → 318666
Portuguese Porttinari : 0 → 168080
Russian Poetry : 0 → 64112
Teko TuDeT : 2272 → 2896
Veps VWT : 0 → 1303
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi
Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika
Kennedy Ajede, Salih Furkan Akkurt, Gabrielė Aleksandravičiūtė, Ika
Alfina, Avner Algom, Khalid Alnajjar, Chiara Alzetta, Erik Andersen,
Lene Antonsen, Tatsuya Aoyama, Katya Aplonova, Angelina Aquino, Carolina
Aragon, Glyd Aranes, Maria Jesus Aranzabe, Bilge Nas Arıcan, Þórunn
Arnardóttir, Gashaw Arutie, Jessica Naraiswari Arwidarasti, Masayuki
Asahara, Katla Ásgeirsdóttir, Deniz Baran Aslan, Cengiz Asmazoğlu, Luma
Ateyah, Furkan Atmaca, Mohammed Attia, Aitziber Atutxa, Liesbeth
Augustinus, Mariana Avelãs, Elena Badmaeva, Keerthana Balasubramani,
Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu
Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov, Colin
Batchelor, John Bauer, Seyyit Talha Bedir, Shabnam Behzad, Juan Belieni,
Kepa Bengoetxea, İbrahim Benli, Yifat Ben Moshe, Gözde Berk, Riyaz Ahmad
Bhat, Erica Biagetti, Eckhard Bick, Agnė Bielinskienė, Kristín
Bjarnadóttir, Rogier Blokland, Victoria Bobicev, Loïc Boizou, Emanuel
Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman,
Adriane Boyd, Anouck Braggaar, António Branco, Kristina Brokaitė,
Aljoscha Burchardt, Marisa Campos, Marie Candito, Bernard Caron,
Gauthier Caron, Catarina Carvalheiro, Rita Carvalho, Lauren Cassidy,
Maria Clara Castro, Sérgio Castro, Tatiana Cavalcanti, Gülşen Cebiroğlu
Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír
Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub,
Liyanage Chamila, Shweta Chauhan, Ethan Chi, Taishi Chika, Yongseok Cho,
Jinho Choi, Jayeol Chun, Juyeon Chung, Alessandra T. Cignarella, Silvie
Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Claudia
Corbetta, Daniela Corbetta, Francisco Costa, Marine Courtin, Benoît
Crabbé, Mihaela Cristescu, Vladimir Cvetkoski, Ingerid Løyning Dale,
Philemon Daniel, Elizabeth Davidson, Leonel Figueiredo de Alencar,
Mathieu Dehouck, Martina de Laurentiis, Marie-Catherine de Marneffe,
Valeria de Paiva, Mehmet Oguz Derin, Elvis de Souza, Arantza Diaz de
Ilarraza, Carly Dickerson, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba
Dione, Peter Dirix, Kaja Dobrovoljc, Adrian Doyle, Timothy Dozat, Kira
Droganova, Magali Sanches Duran, Puneet Dwivedi, Christian Ebert, Hanne
Eckhoff, Masaki Eguchi, Sandra Eiche, Marhaba Eli, Ali Elkahky, Binyam
Ephrem, Olga Erina, Tomaž Erjavec, Farah Essaidi, Aline Etienne,
Wograine Evelyn, Sidney Facundes, Richárd Farkas, Federica Favero,
Jannatul Ferdaousi, Marília Fernanda, Hector Fernandez Alcalde, Amal
Fethi, Jennifer Foster, Theodorus Fransen, Cláudia Freitas, Kazunori
Fujita, Katarína Gajdošová, Daniel Galbraith, Federica Gamba, Marcos
Garcia, Moa Gärdenfors, Fabrício Ferraz Gerardi, Kim Gerdes, Luke
Gessler, Filip Ginter, Gustavo Godoy, Iakes Goenaga, Koldo Gojenola,
Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta González
Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol, Normunds
Grūzītis, Bruno Guillaume, Kirian Guiller, Céline Guillot-Barbance,
Tunga Güngör, Nizar Habash, Hinrik Hafsteinsson, Jan Hajič, Jan Hajič
jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad Yudistira
Hanifmuti, Takahiro Harada, Sam Hardwick, Kim Harris, Dag Haug, Johannes
Heinecke, Oliver Hellwig, Felix Hennig, Barbora Hladká, Jaroslava
Hlaváčová, Florinel Hociung, Petter Hohle, Yidi Huang, Marivel Huerta
Mendez, Jena Hwang, Takumi Ikeda, Anton Karl Ingason, Radu Ion, Elena
Irimia, Ọlájídé Ishola, Artan Islamaj, Kaoru Ito, Sandra Jagodzińska,
Siratun Jannat, Tomáš Jelínek, Apoorva Jha, Katharine Jiang, Anders
Johannsen, Hildur Jónsdóttir, Fredrik Jørgensen, Markus Juutinen, Hüner
Kaşıkara, Nadezhda Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna
Kanerva, Neslihan Kara, Ritván Karahóǧa, Andre Kåsen, Tolga Kayadelen,
Sarveswaran Kengatharaiyer, Václava Kettnerová, Lilit Kharatyan, Jesse
Kirchner, Elena Klementieva, Elena Klyachko, Petr Kocharov, Arne Köhn,
Abdullatif Köksal, Kamil Kopacewicz, Timo Korkiakangas, Mehmet Köse,
Alexey Koshevoy, Natalia Kotsyba, Jolanta Kovalevskaitė, Simon Krek,
Parameswari Krishnamurthy, Sandra Kübler, Adrian Kuqi, Oğuzhan Kuyrukçu,
Aslı Kuzgun, Sookyoung Kwak, Kris Kyle, Käbi Laan, Veronika Laippala,
Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei
Lavrentiev, John Lee, Phương Lê Hồng, Alessandro Lenci, Saran
Lertpradit, Herman Leung, Maria Levina, Lauren Levine, Cheuk Ying Li,
Josie Li, Keying Li, Yixuan Li, Yuan Li, KyungTae Lim, Bruna Lima
Padovani, Yi-Ju Jessica Lin, Krister Lindén, Yang Janet Liu, Nikola
Ljubešić, Irina Lobzhanidze, Olga Loginova, Lucelene Lopes, Stefano
Lusito, Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn,
Vivien Macketanz, Menel Mahamdi, Jean Maillard, Ilya Makarchuk, Aibek
Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, Büşra
Marşan, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Stella
Markantonatou, Héctor Martínez Alonso, Lorena Martín Rodríguez, André
Martins, Cláudia Martins, Jan Mašek, Hiroshi Matsuda, Yuji Matsumoto,
Alessandro Mazzei, Ryan McDonald, Sarah McGuinness, Gustavo Mendonça,
Tatiana Merzhevich, Niko Miekka, Aaron Miller, Karina Mischenkova, Anna
Missilä, Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao, AmirHossein
Mojiri Foroushani, Judit Molnár, Amirsaeid Moloodi, Simonetta
Montemagni, Amir More, Laura Moreno Romero, Giovanni Moretti, Shinsuke
Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili
Müürisep, Pinkey Nainwani, Mariam Nakhlé, Juan Ignacio Navarro
Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Manuela Nevaci,
Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly
Nikolaev, Rattima Nitisaroj, Alireza Nourian, Maria das Graças Volpe
Nunes, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Hulda Óladóttir,
Adédayọ̀ Olúòkun, Mai Omura, Emeka Onwuegbuzia, Noam Ordan, Petya
Osenova, Robert Östling, Lilja Øvrelid, Şaziye Betül Özateş, Merve
Özçelik, Arzucan Özgür, Balkız Öztürk Başaran, Teresa Paccosi, Alessio
Palmero Aprosio, Anastasia Panova, Thiago Alexandre Salgueiro Pardo,
Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco Passarotti,
Agnieszka Patejuk, Guilherme Paulino-Passos, Giulia Pedonese, Angelika
Peljak-Łapińska, Siyao Peng, Siyao Logan Peng, Rita Pereira, Sílvia
Pereira, Cenel-Augusto Perez, Natalia Perkova, Guy Perrier, Slav Petrov,
Daria Petrova, Andrea Peverelli, Jason Phelan, Claudel Pierre-Louis,
Jussi Piitulainen, Yuval Pinter, Clara Pinto, Rodrigo Pintucci, Tommi A
Pirinen, Emily Pitler, Magdalena Plamada, Barbara Plank, Thierry
Poibeau, Larisa Ponomareva, Martin Popel, Lauma Pretkalniņa, Sophie
Prévost, Prokopis Prokopidis, Adam Przepiórkowski, Robert Pugh, Tiina
Puolakainen, Sampo Pyysalo, Peng Qi, Andreia Querido, Andriela Rääbis,
Alexandre Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan Ramasamy,
Carlos Ramisch, Joana Ramos, Fam Rashel, Mohammad Sadegh Rasooli, Vinit
Ravishankar, Livy Real, Petru Rebeja, Siva Reddy, Mathilde Regnault,
Georg Rehm, Arij Riabi, Ivan Riabov, Michael Rießler, Erika Rimkutė,
Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa Rocha, Eiríkur
Rögnvaldsson, Ivan Roksandic, Mykhailo Romanenko, Rudolf Rosa, Valentin
Roșca, Davide Rovati, Ben Rozonoyer, Olga Rudina, Jack Rueter, Kristján
Rúnarsson, Shoval Sadde, Pegah Safari, Aleksi Sahala, Shadi Saleh,
Alessio Salomoni, Tanja Samardžić, Stephanie Samson, Manuela
Sanguinetti, Ezgi Sanıyar, Dage Särg, Marta Sartor, Mitsuya Sasaki,
Baiba Saulīte, Agata Savary, Yanin Sawanakunanon, Shefali Saxena, Kevin
Scannell, Salvatore Scarlata, Emmanuel Schang, Nathan Schneider,
Sebastian Schuster, Lane Schwartz, Djamé Seddah, Wolfgang Seeker, Mojgan
Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Yana
Shishkina, Muh Shohibussirri, Maria Shvedova, Janine Siewert, Einar
Freyr Sigurðsson, João Silva, Aline Silveira, Natalia Silveira, Sara
Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková,
Haukur Barri Símonarson, Kiril Simov, Dmitri Sitchinava, Ted Sither,
Maria Skachedubova, Aaron Smith, Isabela Soares-Bastos, Per Erik
Solberg, Barbara Sonnenhauser, Shafi Sourov, Rachele Sprugnoli, Vivian
Stamou, Steinþór Steingrímsson, Antonio Stella, Abishek Stephen, Milan
Straka, Emmett Strickland, Jana Strnadová, Alane Suhr, Yogi Lesmana
Sulestio, Umut Sulubacak, Shingo Suzuki, Daniel Swanson, Zsolt Szántó,
Chihiro Taguchi, Dima Taji, Fabio Tamburini, Mary Ann C. Tan, Takaaki
Tanaka, Dipta Tanaya, Mirko Tavoni, Samson Tella, Isabelle Tellier,
Marinella Testori, Guillaume Thomas, Sara Tonelli, Liisi Torga, Marsida
Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Türk, Francis
Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire Uematsu,
Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
Utka, Elena Vagnoni, Sowmya Vajjala, Socrates Vak, Rob van der Goot,
Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor Varga,
Uliana Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie,
Veronika Vincze, Natalia Vlasova, Aya Wakasa, Joel C. Wallenberg, Lars
Wallin, Abigail Walsh, Jonathan North Washington, Maximilan Wendt, Paul
Widmer, Shira Wigderson, Sri Hartati Wijono, Vanessa Berwanger Wille,
Seyi Williams, Mats Wirén, Christian Wittern, Tsegay Woldemariam,
Tak-sum Wong, Alina Wróblewska, Qishen Wu, Mary Yako, Kayo Yamashita,
Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife
Betül Yenice, Olcay Taner Yıldız, Zhuoran Yu, Arlisa Yuliawati, Zdeněk
Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou, Hanzhi Zhu, Yilun Zhu,
Anna Zhuravleva, Rayan Ziane
References
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel
Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2,
pp. 255–308.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič,
Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis
Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing
Multilingual Treebank Collection. In Proceedings of LREC.
--------------------------------------------------------------------------------
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.
Manning. 2006. Generating typed dependency parses from phrase structure
parses. In Proceedings of LREC.
Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The
Stanford typed dependencies representation. In COLING Workshop on
Cross-framework and Cross-domain Parser Evaluation.
Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri
Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014.
Universal Stanford Dependencies: A cross-linguistic typology. In
Proceedings of LREC.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg,
Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo
Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal
part-of-speech tagset. In Proceedings of LREC.
Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In
Proceedings of LREC.
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]