We are very happy to announce the twentieth release of annotated
treebanks in Universal Dependencies, v2.14, available at
https://universaldependencies.org/.
Universal Dependencies is a project that seeks to develop
cross-linguistically consistent treebank annotation for many languages
with the goal of facilitating multilingual parser development,
cross-lingual learning, and parsing research from a language typology
perspective (de Marneffe et al., 2021; Nivre et al., 2020). The
annotation scheme is based on (universal) Stanford dependencies (de
Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags
(Petrov et al., 2012), and the Interset interlingua for morphosyntactic
tagsets (Zeman, 2008). The general philosophy is to provide a universal
inventory of categories and guidelines to facilitate consistent
annotation of similar constructions across languages, while allowing
language-specific extensions when necessary.
The *283* treebanks in v2.14 are annotated according to version 2 of the
UD guidelines and represent the following *161* languages: Abaza,
Abkhaz, Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek,
Ancient Hebrew, Apurina, Arabic, Armenian, Assyrian, Azerbaijani,
Bambara, Basque, Bavarian, Beja, Belarusian, Bengali, Bhojpuri, Bororo,
Breton, Bulgarian, Buryat, Cantonese, Cappadocian, Catalan, Cebuano,
Chinese, Chukchi, Classical Armenian, Classical Chinese, Coptic,
Croatian, Czech, Danish, Dutch, Egyptian, English, Erzya, Estonian,
Faroese, Finnish, French, Frisian Dutch, Galician, Georgian, German,
Gheg, Gothic, Greek, Guajajara, Guarani, Gujarati, Haitian Creole,
Hausa, Hebrew, Highland Puebla Nahuatl, Hindi, Hittite, Hungarian,
Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kaapor,
Kangri, Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak, Komi
Zyrian, Korean, Kurmanji, Kyrgyz, Latgalian, Latin, Latvian, Ligurian,
Lithuanian, Livvi, Low Saxon, Luxembourgish, Macedonian, Madi, Maghrebi
Arabic French, Makurap, Malayalam, Maltese, Manx, Marathi, Mbya Guarani,
Middle French, Moksha, Munduruku, Naija, Nayini, Neapolitan, Nheengatu,
North Sami, Norwegian, Old Church Slavonic, Old East Slavic, Old French,
Old Irish, Old Turkish, Ottoman Turkish, Paumari, Persian, Polish,
Pomak, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic,
Serbian, Sinhala, Skolt Sami, Slovak, Slovenian, Soi, South Levantine
Arabic, Spanish, Swedish, Swedish Sign Language, Swiss German, Tagalog,
Tamil, Tatar, Teko, Telugu, Telugu English, Thai, Tswana, Tupinamba,
Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian, Urdu,
Uyghur, Veps, Vietnamese, Warlpiri, Welsh, Western Armenian, Western
Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba, Yupik and
Zaar. The 161 languages belong to *31* families: Afro-Asiatic, Arawakan,
Arawan, Austro-Asiatic, Austronesian, Basque, Bororoan,
Chukotko-Kamchatkan, Code switching, Creole, Dravidian, Eskimo-Aleut,
Indo-European, Japanese, Kartvelian, Korean, Macro-Je, Mande, Mayan,
Mongolic, Niger-Congo, Northwest Caucasian, Pama-Nyungan, Sign Language,
Sino-Tibetan, Tai-Kadai, Tungusic, Tupian, Turkic, Uralic and
Uto-Aztecan. Depending on the language, the treebanks range in size from
less than 1,000 tokens to over 3 million tokens. We expect the next
release to be available in November 2024.
The size of the following 39 treebanks changed significantly since the
last release:
Abkhaz AbNC : 0 → 2444
Azerbaijani TueCL : 0 → 656
Bavarian MaiBaam : 0 → 15024
Beja NSC : 1206 → 5888
Bororo BDT : 1905 → 6993
Cappadocian TueCL : 0 → 4118
Classical Armenian CAVaL: 13522 → 81996
Classical Chinese TueCL : 0 → 648
Dutch LassySmall : 98241 → 297486
Egyptian UJaen : 0 → 5515
English CTeTex : 0 → 9273
English GUM : 187522 → 212035
Galician PUD : 0 → 23510
Gujarati GujTB : 0 → 1885
Hausa NorthernAutogramm : 0 → 3919
Hausa SouthernAutogramm : 0 → 14585
Italian Old : 41367 → 82644
Kyrgyz TueCL : 0 → 1001
Latgalian Cairo : 0 → 173
Latin CIRCSE : 0 → 18968
Latvian Cairo : 0 → 171
Low Saxon LSDC : 4683 → 22639
Luxembourgish LuxBank : 0 → 206
Nheengatu CompLin : 12743 → 15036
Old East Slavic RNC : 48647 → 95551
Old Turkish Clausal : 0 → 158
Old Turkish Tonqq : 158 → 0
Ottoman Turkish BOUN : 0 → 8814
Ottoman Turkish DUDU : 0 → 813
Paumari TueCL : 0 → 504
Pomak Philotis : 86780 → 34348
Romanian TueCL : 0 → 4417
Sanskrit Vedic : 27117 → 206440
Slovenian SST : 29488 → 76341
Spanish COSER : 0 → 8073
Telugu English TECT : 0 → 456
Tswana Popapolelo : 0 → 214
Vietnamese TueCL : 0 → 1888
Zaar Autogramm : 7625 → 17682
In total, the new release contains *1,906,050* sentences, 31,541,523
surface tokens and *32,179,731* syntactic words.
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi
Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika
Kennedy Ajede, Salih Furkan Akkurt, Gabrielė Aleksandravičiūtė, Ika
Alfina, Avner Algom, Khalid Alnajjar, Chiara Alzetta, Erik Andersen,
Lene Antonsen, Tatsuya Aoyama, Katya Aplonova, Angelina Aquino, Carolina
Aragon, Glyd Aranes, Maria Jesus Aranzabe, Bilge Nas Arıcan, Þórunn
Arnardóttir, Gashaw Arutie, Jessica Naraiswari Arwidarasti, Masayuki
Asahara, Katla Ásgeirsdóttir, Deniz Baran Aslan, Cengiz Asmazoğlu, Luma
Ateyah, Furkan Atmaca, Mohammed Attia, Aitziber Atutxa, Liesbeth
Augustinus, Mariana Avelãs, Elena Badmaeva, Keerthana Balasubramani,
Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu
Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov, Colin
Batchelor, John Bauer, Seyyit Talha Bedir, Shabnam Behzad, Juan Belieni,
Kepa Bengoetxea, İbrahim Benli, Yifat Ben Moshe, Ansu Berg, Gözde Berk,
Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė Bielinskienė, Esma
Fatıma Bilgin Taşdemir, Kristín Bjarnadóttir, Verena Blaschke, Rogier
Blokland, Victoria Bobicev, Loïc Boizou, Johnatan Bonilla, Emanuel
Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman,
Adriane Boyd, Anouck Braggaar, António Branco, Kristina Brokaitė,
Aljoscha Burchardt, Marisa Campos, Marie Candito, Bernard Caron,
Gauthier Caron, Catarina Carvalheiro, Rita Carvalho, Lauren Cassidy,
Maria Clara Castro, Sérgio Castro, Tatiana Cavalcanti, Gülşen Cebiroğlu
Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír
Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub,
Liyanage Chamila, Shweta Chauhan, Yifei Chen, Ethan Chi, Taishi Chika,
Yongseok Cho, Jinho Choi, Bermet Chontaeva, Jayeol Chun, Juyeon Chung,
Alessandra T. Cignarella, Silvie Cinková, Aurélie Collomb, Çağrı
Çöltekin, Miriam Connor, Claudia Corbetta, Daniela Corbetta, Francisco
Costa, Marine Courtin, Benoît Crabbé, Mihaela Cristescu, Vladimir
Cvetkoski, Ingerid Løyning Dale, Philemon Daniel, Elizabeth Davidson,
Leonel Figueiredo de Alencar, Mathieu Dehouck, Martina de Laurentiis,
Marie-Catherine de Marneffe, Valeria de Paiva, Mehmet Oguz Derin, Elvis
de Souza, Arantza Diaz de Ilarraza, Roberto Antonio Díaz Hernández,
Carly Dickerson, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione,
Peter Dirix, Hoa Do, Kaja Dobrovoljc, Caroline Döhmer, Adrian Doyle,
Timothy Dozat, Kira Droganova, Magali Sanches Duran, Puneet Dwivedi,
Christian Ebert, Hanne Eckhoff, Masaki Eguchi, Sandra Eiche, Roald
Eiselen, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaž
Erjavec, Soudabeh Eslami, Farah Essaidi, Aline Etienne, Wograine Evelyn,
Sidney Facundes, Richárd Farkas, Federica Favero, Jannatul Ferdaousi,
Marília Fernanda, Hector Fernandez Alcalde, Amal Fethi, Jennifer Foster,
Theodorus Fransen, Cláudia Freitas, Kazunori Fujita, Katarína Gajdošová,
Daniel Galbraith, Edith Galy, Federica Gamba, Marcos Garcia, Moa
Gärdenfors, Tanja Gaustad, Efe Eren Genç, Fabrício Ferraz Gerardi, Kim
Gerdes, Luke Gessler, Filip Ginter, Gustavo Godoy, Iakes Goenaga, Koldo
Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta
González Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol,
Normunds Grūzītis, Bruno Guillaume, Kirian Guiller, Céline
Guillot-Barbance, Tunga Güngör, Nizar Habash, Hinrik Hafsteinsson, Jan
Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad
Yudistira Hanifmuti, Takahiro Harada, Sam Hardwick, Kim Harris, Naïma
Hassert, Dag Haug, Johannes Heinecke, Oliver Hellwig, Felix Hennig,
Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Diana Hoefels,
Petter Hohle, Yidi Huang, Marivel Huerta Mendez, Jena Hwang, Takumi
Ikeda, Inessa Iliadou, Anton Karl Ingason, Radu Ion, Elena Irimia,
Ọlájídé Ishola, Artan Islamaj, Kaoru Ito, Federica Iurescia, Sandra
Jagodzińska, Siratun Jannat, Tomáš Jelínek, Apoorva Jha, Katharine
Jiang, Mayank Jobanputra, Anders Johannsen, Hildur Jónsdóttir, Fredrik
Jørgensen, Markus Juutinen, Hüner Kaşıkara, Nadezhda Kabaeva, Sylvain
Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara, Ritván Karahóǧa,
Andre Kåsen, Tolga Kayadelen, Sarveswaran Kengatharaiyer, Václava
Kettnerová, Lilit Kharatyan, Jesse Kirchner, Elena Klementieva, Elena
Klyachko, Petr Kocharov, Arne Köhn, Abdullatif Köksal, Kamil Kopacewicz,
Timo Korkiakangas, Mehmet Köse, Alexey Koshevoy, Natalia Kotsyba,
Barbara Kovačić, Jolanta Kovalevskaitė, Simon Krek, Parameswari
Krishnamurthy, Sandra Kübler, Adrian Kuqi, Oğuzhan Kuyrukçu, Aslı
Kuzgun, Sookyoung Kwak, Kris Kyle, Käbi Laan, Veronika Laippala, Lorenzo
Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev,
John Lee, Phương Lê Hồng, Alessandro Lenci, Saran Lertpradit, Herman
Leung, Maria Levina, Lauren Levine, Cheuk Ying Li, Josie Li, Keying Li,
Yixuan Li, Yuan Li, KyungTae Lim, Bruna Lima Padovani, Yi-Ju Jessica
Lin, Krister Lindén, Yang Janet Liu, Nikola Ljubešić, Irina Lobzhanidze,
Olga Loginova, Lucelene Lopes, Stefano Lusito, Anne-Marie Lutgen, Andry
Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz,
Menel Mahamdi, Jean Maillard, Ilya Makarchuk, Aibek Makazhanov,
Francesco Mambrini, Michael Mandl, Christopher Manning, Ruli Manurung,
Büşra Marşan, Cătălina Mărănduc, David Mareček, Katrin Marheinecke,
Stella Markantonatou, Héctor Martínez Alonso, Lorena Martín Rodríguez,
André Martins, Cláudia Martins, Jan Mašek, Hiroshi Matsuda, Yuji
Matsumoto, Alessandro Mazzei, Ryan McDonald, Sarah McGuinness, Maitrey
Mehta, Pierre André Ménard, Gustavo Mendonça, Tatiana Merzhevich, Paul
Meurer, Niko Miekka, Emilia Milano, Aaron Miller, Karina Mischenkova,
Anna Missilä, Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao,
AmirHossein Mojiri Foroushani, Judit Molnár, Amirsaeid Moloodi,
Simonetta Montemagni, Amir More, Laura Moreno Romero, Giovanni Moretti,
Shinsuke Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili
Müürisep, Pinkey Nainwani, Mariam Nakhlé, Juan Ignacio Navarro
Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Manuela Nevaci,
Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly
Nikolaev, Rattima Nitisaroj, Victor Norrman, Alireza Nourian, Maria das
Graças Volpe Nunes, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Hulda
Óladóttir, Adédayọ̀ Olúòkun, Mai Omura, Emeka Onwuegbuzia, Noam Ordan,
Petya Osenova, Robert Östling, Annika Ott, Lilja Øvrelid, Şaziye Betül
Özateş, Merve Özçelik, Arzucan Özgür, Balkız Öztürk Başaran, Teresa
Paccosi, Alessio Palmero Aprosio, Anastasia Panova, Thiago Alexandre
Salgueiro Pardo, Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco
Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Giulia
Pedonese, Angelika Peljak-Łapińska, Siyao Peng, Siyao Logan Peng, Rita
Pereira, Sílvia Pereira, Cenel-Augusto Perez, Natalia Perkova, Guy
Perrier, Slav Petrov, Daria Petrova, Andrea Peverelli, Jason Phelan,
Claudel Pierre-Louis, Jussi Piitulainen, Yuval Pinter, Clara Pinto,
Rodrigo Pintucci, Tommi A Pirinen, Emily Pitler, Magdalena Plamada,
Barbara Plank, Alistair Plum, Thierry Poibeau, Larisa Ponomareva, Martin
Popel, Lauma Pretkalniņa, Rigardt Pretorius, Sophie Prévost, Prokopis
Prokopidis, Adam Przepiórkowski, Robert Pugh, Tiina Puolakainen,
Christoph Purschke, Sampo Pyysalo, Peng Qi, Andreia Querido, Andriela
Rääbis, Alexandre Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan
Ramasamy, Carlos Ramisch, Joana Ramos, Fam Rashel, Mohammad Sadegh
Rasooli, Vinit Ravishankar, Livy Real, Petru Rebeja, Siva Reddy,
Mathilde Regnault, Georg Rehm, Arij Riabi, Ivan Riabov, Michael Rießler,
Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa
Rocha, Eiríkur Rögnvaldsson, Ivan Roksandic, Mykhailo Romanenko, Rudolf
Rosa, Valentin Roșca, Davide Rovati, Ben Rozonoyer, Olga Rudina, Jack
Rueter, Paolo Ruffolo, Kristján Rúnarsson, Shoval Sadde, Pegah Safari,
Aleksi Sahala, Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Stephanie
Samson, Xulia Sánchez-Rodríguez, Manuela Sanguinetti, Ezgi Sanıyar, Dage
Särg, Marta Sartor, Albina Sarymsakova, Mitsuya Sasaki, Baiba Saulīte,
Agata Savary, Yanin Sawanakunanon, Shefali Saxena, Kevin Scannell,
Salvatore Scarlata, Emmanuel Schang, Nathan Schneider, Sebastian
Schuster, Lane Schwartz, Djamé Seddah, Wolfgang Seeker, Sven Sellmer,
Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Hiroyuki
Shirasu, Yana Shishkina, Muh Shohibussirri, Maria Shvedova, Janine
Siewert, Einar Freyr Sigurðsson, João Silva, Aline Silveira, Natalia
Silveira, Sara Silveira, Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Haukur Barri Símonarson, Kiril Simov, Dmitri Sitchinava,
Ted Sither, Aaron Smith, Isabela Soares-Bastos, Per Erik Solberg,
Barbara Sonnenhauser, Shafi Sourov, Rachele Sprugnoli, Vivian Stamou,
Steinþór Steingrímsson, Antonio Stella, Abishek Stephen, Milan Straka,
Emmett Strickland, Jana Strnadová, Alane Suhr, Yogi Lesmana Sulestio,
Umut Sulubacak, Shingo Suzuki, Daniel Swanson, Zsolt Szántó, Chihiro
Taguchi, Dima Taji, Fabio Tamburini, Mary Ann C. Tan, Takaaki Tanaka,
Dipta Tanaya, Mirko Tavoni, Samson Tella, Isabelle Tellier, Marinella
Testori, Guillaume Thomas, Tarık Emre Tıraş, Sara Tonelli, Liisi Torga,
Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Türk,
Francis Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire
Uematsu, Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit,
Andrius Utka, Elena Vagnoni, Sowmya Vajjala, Socrates Vak, Rob van der
Goot, Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor
Varga, Uliana Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie,
Veronika Vincze, Anishka Vissamsetty, Natalia Vlasova, Eleni
Vligouridou, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh,
John Wang, Jonathan North Washington, Maximilan Wendt, Paul Widmer,
Shira Wigderson, Sri Hartati Wijono, Vanessa Berwanger Wille, Seyi
Williams, Mats Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum
Wong, Alina Wróblewska, Qishen Wu, Mary Yako, Kayo Yamashita, Naoki
Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife Betül
Yenice, Enes Yılandiloğlu, Olcay Taner Yıldız, Zhuoran Yu, Arlisa
Yuliawati, Zdeněk Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou,
Hanzhi Zhu, Yilun Zhu, Anna Zhuravleva, Rayan Ziane
References
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel
Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2,
pp. 255–308.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič,
Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis
Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing
Multilingual Treebank Collection. In Proceedings of LREC.
--------------------------------------------------------------------------------
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.
Manning. 2006. Generating typed dependency parses from phrase structure
parses. In Proceedings of LREC.
Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The
Stanford typed dependencies representation. In COLING Workshop on
Cross-framework and Cross-domain Parser Evaluation.
Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri
Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014.
Universal Stanford Dependencies: A cross-linguistic typology. In
Proceedings of LREC.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg,
Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo
Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal
part-of-speech tagset. In Proceedings of LREC.
Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In
Proceedings of LREC._______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]