We are very happy to announce the twenty-first release of annotated
treebanks in Universal Dependencies, v2.15, available at
https://universaldependencies.org/.
Universal Dependencies is a project that seeks to develop
cross-linguistically consistent treebank annotation for many languages
with the goal of facilitating multilingual parser development,
cross-lingual learning, and parsing research from a language typology
perspective (de Marneffe et al., 2021; Nivre et al., 2020). The
annotation scheme is based on (universal) Stanford dependencies (de
Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags
(Petrov et al., 2012), and the Interset interlingua for morphosyntactic
tagsets (Zeman, 2008). The general philosophy is to provide a universal
inventory of categories and guidelines to facilitate consistent
annotation of similar constructions across languages, while allowing
language-specific extensions when necessary.
The *296* treebanks in v2.15 are annotated according to version 2 of the
UD guidelines and represent the following *168* languages: Abaza,
Abkhaz, Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek,
Ancient Hebrew, Apurina, Arabic, Armenian, Assyrian, Azerbaijani,
Bambara, Basque, Bavarian, Beja, Belarusian, Bengali, Bhojpuri, Bororo,
Breton, Bulgarian, Buryat, Cantonese, Cappadocian, Catalan, Cebuano,
Chinese, Chukchi, Classical Armenian, Classical Chinese, Coptic,
Croatian, Czech, Danish, Dutch, Egyptian, English, Erzya, Estonian,
Faroese, Finnish, French, Frisian Dutch, Galician, Georgian, German,
Gheg, Gothic, Greek, Guajajara, Guarani, Gujarati, Gwichin, Haitian
Creole, Hausa, Hebrew, Highland Puebla Nahuatl, Hindi, Hittite,
Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese,
Kaapor, Kangri, Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak,
Komi Zyrian, Korean, Kurmanji, Kyrgyz, Latgalian, Latin, Latvian,
Ligurian, Lithuanian, Livvi, Low Saxon, Luxembourgish, Macedonian, Madi,
Maghrebi Arabic French, Makurap, Malayalam, Maltese, Manx, Marathi, Mbya
Guarani, Middle French, Moksha, Munduruku, Naija, Nayini, Neapolitan,
Nheengatu, North Sami, Northwest Gbaya, Norwegian, Old Church Slavonic,
Old East Slavic, Old French, Old Irish, Old Turkish, Ottoman Turkish,
Pashto, Paumari, Persian, Pesh, Phrygian, Polish, Pomak, Portuguese,
Romanian, Russian, Sanskrit, Scottish Gaelic, Serbian, Sinhala, Skolt
Sami, Slovak, Slovenian, Soi, South Levantine Arabic, Spanish, Spanish
Sign Language, Swedish, Swedish Sign Language, Swiss German, Tagalog,
Tamil, Tatar, Teko, Telugu, Telugu English, Thai, Tswana, Tupinamba,
Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian, Urdu,
Uyghur, Uzbek, Veps, Vietnamese, Warlpiri, Welsh, Western Armenian,
Western Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba,
Yupik and Zaar. The 168 languages belong to *33* families: Afro-Asiatic,
Arawakan, Arawan, Austro-Asiatic, Austronesian, Basque, Bororoan,
Chibchan, Chukotko-Kamchatkan, Code switching, Creole, Dravidian,
Eskimo-Aleut, Indo-European, Japanese, Kartvelian, Korean, Macro-Je,
Mande, Mayan, Mongolic, Na-Dene, Niger-Congo, Northwest Caucasian,
Pama-Nyungan, Sign Language, Sino-Tibetan, Tai-Kadai, Tungusic, Tupian,
Turkic, Uralic and Uto-Aztecan. Depending on the language, the treebanks
range in size from less than 1,000 tokens to over 3 million tokens. We
expect the next release to be available in May 2025.
The size of the following 24 treebanks changed significantly since the
last release:
Abkhaz AbNC : 2444 → 6363
Albanian STAF : 0 → 3563
Beja Autogramm : 0 → 11951
Beja NSC : 5888 → 0
Cappadocian AMGiC : 0 → 451
Egyptian UJaen : 5515 → 14650
Georgian GLC : 2335 → 60173
Gwichin TueCL : 0 → 1008
Hebrew IAHLTknesset : 0 → 67007
Italian Old : 82644 → 122038
Korean KSL : 0 → 66989
Kyrgyz KTMU : 7451 → 23654
Nheengatu CompLin : 15036 → 19278
Northwest Gbaya Autogramm: 0 → 2417
Old East Slavic RNC : 95551 → 168064
Old East Slavic Ruthenian: 96803 → 111503
Pashto Sikaram : 0 → 995
Pesh ChibErgIS : 0 → 2508
Phrygian KUL : 0 → 1687
Portuguese DANTEStocks : 0 → 80997
Slovenian SST : 76341 → 98393
Spanish Sign Language LSE: 0 → 1393
Ukrainian ParlaMint : 0 → 51997
Uzbek UT : 0 → 5850
In total, the new release contains *1,939,085* sentences, 32,078,118
surface tokens and *32,741,781* syntactic words.
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi
Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika
Kennedy Ajede, Arofat Akhundjanova, Furkan Akkurt, Gabrielė
Aleksandravičiūtė, Ika Alfina, Avner Algom, Khalid Alnajjar, Chiara
Alzetta, Erik Andersen, Matthew Andrews, Lene Antonsen, Tatsuya Aoyama,
Katya Aplonova, Angelina Aquino, Carolina Aragon, Glyd Aranes, Maria
Jesus Aranzabe, Bilge Nas Arıcan, Þórunn Arnardóttir, Gashaw Arutie,
Jessica Naraiswari Arwidarasti, Masayuki Asahara, Katla Ásgeirsdóttir,
Deniz Baran Aslan, Cengiz Asmazoğlu, Luma Ateyah, Furkan Atmaca,
Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Mariana Avelãs,
Elena Badmaeva, Keerthana Balasubramani, Miguel Ballesteros, Esha
Banerjee, Sebastian Bank, Bryan Khelven da Silva Barbosa, Verginica
Barbu Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov,
Colin Batchelor, John Bauer, Seyyit Talha Bedir, Shabnam Behzad, Juan
Belieni, Kepa Bengoetxea, İbrahim Benli, Yifat Ben Moshe, Ansu Berg,
Gözde Berk, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė
Bielinskienė, Esma Fatıma Bilgin Taşdemir, Kristín Bjarnadóttir, Verena
Blaschke, Rogier Blokland, Nina Böbel, Victoria Bobicev, Loïc Boizou,
Johnatan Bonilla, Emanuel Borges Völker, Carl Börstell, Cristina Bosco,
Gosse Bouma, Sam Bowman, Adriane Boyd, Anouck Braggaar, António Branco,
Kristina Brokaitė, Aljoscha Burchardt, Carmen Cabeza, Natalia Cáceres
Arandia, Marisa Campos, Marie Candito, Bernard Caron, Gauthier Caron,
Catarina Carvalheiro, Rita Carvalho, Lauren Cassidy, Maria Clara Castro,
Sérgio Castro, Tatiana Cavalcanti, Gülşen Cebiroğlu Eryiğit, Flavio
Massimiliano Cecchini, Giuseppe G. A. Celano, Anila Çepani, Slavomír
Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub,
Liyanage Chamila, Claudine Chamoreau, Shweta Chauhan, Yifei Chen, Ethan
Chi, Taishi Chika, Yongseok Cho, Jinho Choi, Bermet Chontaeva, Jayeol
Chun, Juyeon Chung, Alessandra T. Cignarella, Silvie Cinková, Aurélie
Collomb, Çağrı Çöltekin, Miriam Connor, Claudia Corbetta, Daniela
Corbetta, Francisco Costa, Marine Courtin, Benoît Crabbé, Mihaela
Cristescu, Vladimir Cvetkoski, Netanel Dahan, Ingerid Løyning Dale,
Philemon Daniel, Elizabeth Davidson, Leonel Figueiredo de Alencar,
Mathieu Dehouck, Martina de Laurentiis, Marie-Catherine de Marneffe,
Valeria de Paiva, Mehmet Oguz Derin, Elvis de Souza, Arantza Diaz de
Ilarraza, Roberto Antonio Díaz Hernández, Carly Dickerson, Ariani Di
Felippo, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione, Peter
Dirix, Hoa Do, Kaja Dobrovoljc, Caroline Döhmer, Adrian Doyle, Timothy
Dozat, Kira Droganova, Magali Sanches Duran, Puneet Dwivedi, Christian
Ebert, Hanne Eckhoff, Masaki Eguchi, Sandra Eiche, Roald Eiselen,
Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaž Erjavec,
Soudabeh Eslami, Farah Essaidi, Aline Etienne, Wograine Evelyn, Sidney
Facundes, Richárd Farkas, Ján Faryad, Federica Favero, Jannatul
Ferdaousi, Marília Fernanda, Hector Fernandez Alcalde, Amal Fethi,
Jennifer Foster, Theodorus Fransen, Cláudia Freitas, Kazunori Fujita,
Katarína Gajdošová, Daniel Galbraith, Edith Galy, Federica Gamba, Marcos
Garcia, José María García-Miguel, Moa Gärdenfors, Tanja Gaustad, Efe
Eren Genç, Fabrício Ferraz Gerardi, Kim Gerdes, Luke Gessler, Filip
Ginter, Gustavo Godoy, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak,
Yoav Goldberg, Gili Goldin, Xavier Gómez Guinovart, Berta González
Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol, Normunds
Grūzītis, Bruno Guillaume, Kirian Guiller, Céline Guillot-Barbance,
Tunga Güngör, Vladimir Gurevich, Nizar Habash, Hinrik Hafsteinsson, Jan
Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad
Yudistira Hanifmuti, Takahiro Harada, Sam Hardwick, Kim Harris, Naïma
Hassert, Dag Haug, Johannes Heinecke, Oliver Hellwig, Felix Hennig,
Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Diana Hoefels,
Petter Hohle, Nick Howell, Yidi Huang, Marivel Huerta Mendez, Jena
Hwang, Takumi Ikeda, Inessa Iliadou, Anton Karl Ingason, Radu Ion, Elena
Irimia, Ọlájídé Ishola, Artan Islamaj, Kaoru Ito, Federica Iurescia,
Sandra Jagodzińska, Siratun Jannat, Tomáš Jelínek, Apoorva Jha,
Katharine Jiang, Mayank Jobanputra, Anders Johannsen, Hildur Jónsdóttir,
Fredrik Jørgensen, Markus Juutinen, Hüner Kaşıkara, Nadezhda Kabaeva,
Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara, Ritván
Karahóǧa, Andre Kåsen, Tolga Kayadelen, Sarveswaran Kengatharaiyer,
Václava Kettnerová, Lilit Kharatyan, Jesse Kirchner, Elena Klementieva,
Elena Klyachko, Petr Kocharov, Arne Köhn, Abdullatif Köksal, Kamil
Kopacewicz, Timo Korkiakangas, Mehmet Köse, Alexey Koshevoy, Nelda Kote,
Natalia Kotsyba, Barbara Kovačić, Jolanta Kovalevskaitė, Emmanuelle
Kowner, Simon Krek, Parameswari Krishnamurthy, Sandra Kübler, Adrian
Kuqi, Oğuzhan Kuyrukçu, Aslı Kuzgun, Sookyoung Kwak, Kris Kyle, Käbi
Laan, Veronika Laippala, Lorenzo Lambertino, Israel Landau, Tatiana
Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phương Lê
Hồng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Maria Levina,
Lauren Levine, Cheuk Ying Li, Josie Li, Keying Li, Yixuan Li, Yuan Li,
KyungTae Lim, Bruna Lima Padovani, Yi-Ju Jessica Lin, Krister Lindén,
Yang Janet Liu, Nikola Ljubešić, Irina Lobzhanidze, Olga Loginova,
Lucelene Lopes, Edita Luftiu, Arsenii Lukashevskyi, Stefano Lusito,
Anne-Marie Lutgen, Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa
Lynn, Vivien Macketanz, Menel Mahamdi, Jean Maillard, Ilya Makarchuk,
Aibek Makazhanov, Francesco Mambrini, Michael Mandl, Christopher
Manning, Ruli Manurung, Büşra Marşan, Cătălina Mărănduc, David Mareček,
Katrin Marheinecke, Stella Markantonatou, Héctor Martínez Alonso, Lorena
Martín Rodríguez, André Martins, Cláudia Martins, Jan Mašek, Hiroshi
Matsuda, Yuji Matsumoto, Alessandro Mazzei, Ryan McDonald, Sarah
McGuinness, Maitrey Mehta, Pierre André Ménard, Gustavo Mendonça, Hilla
Merhav, Tatiana Merzhevich, Paul Meurer, Niko Miekka, Emilia Milano,
Aaron Miller, Yael Minerbi, Karina Mischenkova, Anna Missilä, Cătălin
Mititelu, Maria Mitrofan, Yusuke Miyao, AmirHossein Mojiri Foroushani,
Judit Molnár, Amirsaeid Moloodi, Simonetta Montemagni, Amir More, Laura
Moreno Romero, Giovanni Moretti, Shinsuke Mori, Tomohiko Morioka,
Shigeki Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek,
Robert Munro, Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Mariam
Nakhlé, Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, Gunta
Nešpore-Bērzkalne, Manuela Nevaci, Lương Nguyễn Thị, Huyền Nguyễn Thị
Minh, Yoshihiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Victor
Norrman, Alireza Nourian, Maria das Graças Volpe Nunes, Hanna Nurmi,
Stina Ojala, Atul Kr. Ojha, Hulda Óladóttir, Adédayọ̀ Olúòkun, Mai
Omura, Emeka Onwuegbuzia, Noam Ordan, Petya Osenova, Robert Östling,
Annika Ott, Lilja Øvrelid, Şaziye Betül Özateş, Merve Özçelik, Arzucan
Özgür, Balkız Öztürk Başaran, Teresa Paccosi, Alessio Palmero Aprosio,
Anastasia Panova, Thiago Alexandre Salgueiro Pardo, Hyunji Hayley Park,
Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk,
Guilherme Paulino-Passos, Giulia Pedonese, Oggi Peeters, Angelika
Peljak-Łapińska, Siyao Peng, Siyao Logan Peng, Rita Pereira, Sílvia
Pereira, Cenel-Augusto Perez, Natalia Perkova, Guy Perrier, Slav Petrov,
Daria Petrova, Andrea Peverelli, Jason Phelan, Claudel Pierre-Louis,
Jussi Piitulainen, Yuval Pinter, Clara Pinto, Rodrigo Pintucci, Tommi A
Pirinen, Emily Pitler, Magdalena Plamada, Barbara Plank, Alistair Plum,
Thierry Poibeau, Larisa Ponomareva, Martin Popel, Lauma Pretkalniņa,
Rigardt Pretorius, Sophie Prévost, Prokopis Prokopidis, Adam
Przepiórkowski, Robert Pugh, Tiina Puolakainen, Christoph Purschke,
Sampo Pyysalo, Peng Qi, Andreia Querido, Andriela Rääbis, Ella
Rabinovich, Alexandre Rademaker, Mizanur Rahoman, Taraka Rama,
Loganathan Ramasamy, Carlos Ramisch, Joana Ramos, Fam Rashel, Mohammad
Sadegh Rasooli, Vinit Ravishankar, Livy Real, Petru Rebeja, Siva Reddy,
Mathilde Regnault, Georg Rehm, Arij Riabi, Ivan Riabov, Michael Rießler,
Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa
Rocha, Eiríkur Rögnvaldsson, Ivan Roksandic, Norton Trevisan Roman,
Mykhailo Romanenko, Rudolf Rosa, Valentin Roșca, Paulette Roulon, Davide
Rovati, Ben Rozonoyer, Olga Rudina, Jack Rueter, Paolo Ruffolo, Kristján
Rúnarsson, Rozana Rushiti, Shoval Sadde, Pegah Safari, Aleksi Sahala,
Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Konstantinos Sampanis,
Stephanie Samson, Xulia Sánchez-Rodríguez, Manuela Sanguinetti, Ezgi
Sanıyar, Dage Särg, Marta Sartor, Albina Sarymsakova, Mitsuya Sasaki,
Baiba Saulīte, Agata Savary, Yanin Sawanakunanon, Shefali Saxena, Kevin
Scannell, Salvatore Scarlata, Emmanuel Schang, Nathan Schneider,
Sebastian Schuster, Lane Schwartz, Djamé Seddah, Wolfgang Seeker, Sven
Sellmer, Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Gyu-Ho
Shin, Hiroyuki Shirasu, Yana Shishkina, Muh Shohibussirri, Maria
Shvedova, Janine Siewert, Einar Freyr Sigurðsson, João Silva, Aline
Silveira, Natalia Silveira, Sara Silveira, Maria Simi, Radu Simionescu,
Katalin Simkó, Mária Šimková, Haukur Barri Símonarson, Kiril Simov,
Dmitri Sitchinava, Ted Sither, Aaron Smith, Isabela Soares-Bastos, Per
Erik Solberg, Barbara Sonnenhauser, Shafi Sourov, Rachele Sprugnoli,
Vivian Stamou, Steinþór Steingrímsson, Antonio Stella, Abishek Stephen,
Milan Straka, Omer Strass, Emmett Strickland, Jana Strnadová, Alane
Suhr, Yogi Lesmana Sulestio, Umut Sulubacak, Hakyung Sung, Shingo
Suzuki, Daniel Swanson, Zsolt Szántó, Chihiro Taguchi, Dima Taji, Luigi
Talamo, Fabio Tamburini, Mary Ann C. Tan, Takaaki Tanaka, Dipta Tanaya,
Mirko Tavoni, Samson Tella, Isabelle Tellier, Marinella Testori,
Guillaume Thomas, Tarık Emre Tıraş, Sara Tonelli, Liisi Torga, Marsida
Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Türk, Francis
Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire Uematsu,
Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Andrius
Utka, Elena Vagnoni, Sowmya Vajjala, Socrates Vak, Rob van der Goot,
Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor Varga,
Uliana Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie,
Veronika Vincze, Anishka Vissamsetty, Natalia Vlasova, Eleni
Vligouridou, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh,
John Wang, Jonathan North Washington, Leonie Weissweiler, Maximilan
Wendt, Paul Widmer, Shira Wigderson, Sri Hartati Wijono, Vanessa
Berwanger Wille, Seyi Williams, Miriam Winkler, Shuly Wintner, Mats
Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina
Wróblewska, Qishen Wu, Mary Yako, Kayo Yamashita, Naoki Yamazaki,
Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife Betül Yenice,
Enes Yılandiloğlu, Olcay Taner Yıldız, Zhuoran Yu, Arlisa Yuliawati,
Zdeněk Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou, Hanzhi Zhu,
Yilun Zhu, Anna Zhuravleva, Rayan Ziane, Artūrs Znotiņš
References
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel
Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2,
pp. 255–308.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič,
Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis
Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing
Multilingual Treebank Collection. In Proceedings of LREC.
--------------------------------------------------------------------------------
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.
Manning. 2006. Generating typed dependency parses from phrase structure
parses. In Proceedings of LREC.
Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The
Stanford typed dependencies representation. In COLING Workshop on
Cross-framework and Cross-domain Parser Evaluation.
Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri
Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014.
Universal Stanford Dependencies: A cross-linguistic typology. In
Proceedings of LREC.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg,
Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo
Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal
part-of-speech tagset. In Proceedings of LREC.
Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In
Proceedings of LREC._______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]