We are very happy to announce the seventeenth release of annotated
treebanks in Universal Dependencies, v2.11, available at
http://universaldependencies.org/.
Universal Dependencies is a project that seeks to develop
cross-linguistically consistent treebank annotation for many languages
with the goal of facilitating multilingual parser development,
cross-lingual learning, and parsing research from a language typology
perspective (de Marneffe et al., 2021; Nivre et al., 2020). The
annotation scheme is based on (universal) Stanford dependencies (de
Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags
(Petrov et al., 2012), and the Interset interlingua for morphosyntactic
tagsets (Zeman, 2008). The general philosophy is to provide a universal
inventory of categories and guidelines to facilitate consistent
annotation of similar constructions across languages, while allowing
language-specific extensions when necessary.
The *243* treebanks in v2.11 are annotated according to version 2 of the
UD guidelines and represent the following *138 languages:* Abaza,
Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek, Ancient
Hebrew, Apurina, Arabic, Armenian, Assyrian, Bambara, Basque, Beja,
Belarusian, Bengali, Bhojpuri, Breton, Bulgarian, Buryat, Cantonese,
Catalan, Cebuano, Chinese, Chukchi, Classical Chinese, Coptic, Croatian,
Czech, Danish, Dutch, English, Erzya, Estonian, Faroese, Finnish,
French, Frisian Dutch, Galician, German, Gheg, Gothic, Greek, Guajajara,
Guarani, Hebrew, Hindi, Hindi English, Hittite, Hungarian, Icelandic,
Indonesian, Irish, Italian, Japanese, Javanese, Kaapor, Kangri,
Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak, Komi Zyrian,
Korean, Kurmanji, Latin, Latvian, Ligurian, Lithuanian, Livvi, Low
Saxon, Madi, Makurap, Malayalam, Maltese, Manx, Marathi, Mbya Guarani,
Moksha, Munduruku, Naija, Nayini, Neapolitan, Nheengatu, North Sami,
Norwegian, Old Church Slavonic, Old East Slavic, Old French, Old
Turkish, Persian, Polish, Pomak, Portuguese, Romanian, Russian,
Sanskrit, Scottish Gaelic, Serbian, Sinhala, Skolt Sami, Slovak,
Slovenian, Soi, South Levantine Arabic, Spanish, Swedish, Swedish Sign
Language, Swiss German, Tagalog, Tamil, Tatar, Teko, Telugu, Thai,
Tupinamba, Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian,
Urdu, Uyghur, Vietnamese, Warlpiri, Welsh, Western Armenian, Western
Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba, Yupik and
Zaar. The 138 languages belong to *29 families:* Afro-Asiatic, Arawakan,
Arawan, Austro-Asiatic, Austronesian, Basque, Chukotko-Kamchatkan, Code
switching, Creole, Dravidian, Eskimo-Aleut, Indo-European, Japanese,
Korean, Macro-Je, Mande, Mayan, Mongolic, Niger-Congo, Northwest
Caucasian, Pama-Nyungan, Sign Language, Sino-Tibetan, Tai-Kadai,
Tungusic, Tupian, Turkic, Uralic and Uto-Aztecan. Depending on the
language, the treebanks range in size from less than 1,000 tokens to
over 3 million tokens. We expect the next release to be available in May
2023.
The size of the following 28 treebanks changed significantly since the
last release:
Abaza ATB : 0 → 652
Akuntsu TuDeT : 1074 → 1324
Apurina UFPA : 776 → 865
Chinese PatentChar : 0 → 2160
Erzya JR : 17412 → 20541
Estonian EWT : 78331 → 90694
French ParisStories : 30004 → 42865
Gheg GPS : 0 → 15990
Icelandic GC : 0 → 99611
Icelandic Modern : 158150 → 80395
Irish Cadhan : 0 → 3804
Italian ParlaMint : 0 → 20460
Low Saxon LSDC : 2547 → 2935
Makurap TuDeT : 146 → 178
Malayalam UFAL : 0 → 202
Nheengatu CompLin : 0 → 2146
Old East Slavic RNC : 35606 → 48647
Old East Slavic Ruthenian : 0 → 3069
Portuguese CINTIL : 0 → 475860
Portuguese PetroGold : 0 → 250605
Sinhala STB : 0 → 880
Tatar NMCTT : 1458 → 2280
Teko TuDeT : 242 → 1375
Umbrian IKUVINA : 602 → 786
Western Sierra Puebla Nahuatl ITML: 0 → 10120
Xavante XDT : 0 → 120
Yakut YKTDT : 496 → 1403
Zaar Autogramm : 0 → 7625
Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi
Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika
Kennedy Ajede, Salih Furkan Akkurt, Gabrielė Aleksandravičiūtė, Ika
Alfina, Avner Algom, Chiara Alzetta, Erik Andersen, Lene Antonsen, Katya
Aplonova, Angelina Aquino, Carolina Aragon, Glyd Aranes, Maria Jesus
Aranzabe, Bilge Nas Arıcan, Þórunn Arnardóttir, Gashaw Arutie, Jessica
Naraiswari Arwidarasti, Masayuki Asahara, Katla Ásgeirsdóttir, Deniz
Baran Aslan, Cengiz Asmazoğlu, Luma Ateyah, Furkan Atmaca, Mohammed
Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Keerthana
Balasubramani, Miguel Ballesteros, Esha Banerjee, Sebastian Bank,
Verginica Barbu Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria
Basmov, Colin Batchelor, John Bauer, Seyyit Talha Bedir, Juan Belieni,
Kepa Bengoetxea, Yifat Ben Moshe, Gözde Berk, Yevgeni Berzak, Irshad
Ahmad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė
Bielinskienė, Kristín Bjarnadóttir, Rogier Blokland, Victoria Bobicev,
Loïc Boizou, Emanuel Borges Völker, Carl Börstell, Cristina Bosco, Gosse
Bouma, Sam Bowman, Adriane Boyd, Anouck Braggaar, Kristina Brokaitė,
Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Lauren
Cassidy, Maria Clara Castro, Tatiana Cavalcanti, Gülşen Cebiroğlu
Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír
Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub,
Liyanage Chamila, Shweta Chauhan, Ethan Chi, Taishi Chika, Yongseok Cho,
Jinho Choi, Jayeol Chun, Juyeon Chung, Alessandra T. Cignarella, Silvie
Cinková, Aurélie Collomb, Çağrı Çöltekin, Miriam Connor, Daniela
Corbetta, Marine Courtin, Mihaela Cristescu, Philemon Daniel, Elizabeth
Davidson, Leonel Figueiredo de Alencar, Mathieu Dehouck, Martina de
Laurentiis, Marie-Catherine de Marneffe, Valeria de Paiva, Mehmet Oguz
Derin, Elvis de Souza, Arantza Diaz de Ilarraza, Carly Dickerson,
Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione, Peter Dirix, Kaja
Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Christian
Ebert, Hanne Eckhoff, Sandra Eiche, Marhaba Eli, Ali Elkahky, Binyam
Ephrem, Olga Erina, Tomaž Erjavec, Aline Etienne, Wograine Evelyn,
Sidney Facundes, Richárd Farkas, Federica Favero, Jannatul Ferdaousi,
Marília Fernanda, Hector Fernandez Alcalde, Jennifer Foster, Cláudia
Freitas, Kazunori Fujita, Katarína Gajdošová, Daniel Galbraith, Federica
Gamba, Marcos Garcia, Moa Gärdenfors, Sebastian Garza, Fabrício Ferraz
Gerardi, Kim Gerdes, Filip Ginter, Gustavo Godoy, Iakes Goenaga, Koldo
Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta
González Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol,
Normunds Grūzītis, Bruno Guillaume, Céline Guillot-Barbance, Tunga
Güngör, Nizar Habash, Hinrik Hafsteinsson, Jan Hajič, Jan Hajič jr.,
Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad Yudistira Hanifmuti,
Takahiro Harada, Sam Hardwick, Kim Harris, Dag Haug, Johannes Heinecke,
Oliver Hellwig, Felix Hennig, Barbora Hladká, Jaroslava Hlaváčová,
Florinel Hociung, Petter Hohle, Marivel Huerta Mendez, Jena Hwang,
Takumi Ikeda, Anton Karl Ingason, Radu Ion, Elena Irimia, Ọlájídé
Ishola, Artan Islamaj, Kaoru Ito, Siratun Jannat, Tomáš Jelínek, Apoorva
Jha, Katharine Jiang, Anders Johannsen, Hildur Jónsdóttir, Fredrik
Jørgensen, Markus Juutinen, Hüner Kaşıkara, Andre Kaasen, Nadezhda
Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara,
Ritván Karahóǧa, Boris Katz, Tolga Kayadelen, Sarveswaran
Kengatharaiyer, Jessica Kenney, Václava Kettnerová, Jesse Kirchner,
Elena Klementieva, Elena Klyachko, Arne Köhn, Abdullatif Köksal, Kamil
Kopacewicz, Timo Korkiakangas, Mehmet Köse, Alexey Koshevoy, Natalia
Kotsyba, Jolanta Kovalevskaitė, Simon Krek, Parameswari Krishnamurthy,
Sandra Kübler, Adrian Kuqi, Oğuzhan Kuyrukçu, Aslı Kuzgun, Sookyoung
Kwak, Veronika Laippala, Lucia Lam, Lorenzo Lambertino, Tatiana Lando,
Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phương Lê Hồng,
Alessandro Lenci, Saran Lertpradit, Herman Leung, Maria Levina, Cheuk
Ying Li, Josie Li, Keying Li, Yixuan Li, Yuan Li, KyungTae Lim, Bruna
Lima Padovani, Krister Lindén, Nikola Ljubešić, Olga Loginova, Stefano
Lusito, Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn,
Vivien Macketanz, Menel Mahamdi, Jean Maillard, Ilya Makarchuk, Aibek
Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, Büşra
Marşan, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, Stella
Markantonatou, Héctor Martínez Alonso, Lorena Martín Rodríguez, André
Martins, Jan Mašek, Hiroshi Matsuda, Yuji Matsumoto, Alessandro Mazzei,
Ryan McDonald, Sarah McGuinness, Gustavo Mendonça, Tatiana Merzhevich,
Niko Miekka, Karina Mischenkova, Margarita Misirpashayeva, Anna Missilä,
Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao, AmirHossein Mojiri
Foroushani, Judit Molnár, Amirsaeid Moloodi, Simonetta Montemagni, Amir
More, Laura Moreno Romero, Giovanni Moretti, Keiko Sophie Mori, Shinsuke
Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili
Müürisep, Pinkey Nainwani, Mariam Nakhlé, Juan Ignacio Navarro
Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Manuela Nevaci,
Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly
Nikolaev, Rattima Nitisaroj, Alireza Nourian, Hanna Nurmi, Stina Ojala,
Atul Kr. Ojha, Hulda Óladóttir, Adédayọ̀ Olúòkun, Mai Omura, Emeka
Onwuegbuzia, Noam Ordan, Petya Osenova, Robert Östling, Lilja Øvrelid,
Şaziye Betül Özateş, Betül Özateş, Merve Özçelik, Arzucan Özgür, Balkız
Öztürk Başaran, Teresa Paccosi, Alessio Palmero Aprosio, Anastasia
Panova, Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco
Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Giulia
Pedonese, Angelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto Perez,
Natalia Perkova, Guy Perrier, Slav Petrov, Daria Petrova, Andrea
Peverelli, Jason Phelan, Jussi Piitulainen, Rodrigo Pintucci, Tommi A
Pirinen, Emily Pitler, Magdalena Plamada, Barbara Plank, Thierry
Poibeau, Larisa Ponomareva, Martin Popel, Lauma Pretkalniņa, Sophie
Prévost, Prokopis Prokopidis, Adam Przepiórkowski, Robert Pugh, Tiina
Puolakainen, Sampo Pyysalo, Peng Qi, Andriela Rääbis, Alexandre
Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan Ramasamy, Carlos
Ramisch, Fam Rashel, Mohammad Sadegh Rasooli, Vinit Ravishankar, Livy
Real, Petru Rebeja, Siva Reddy, Mathilde Regnault, Georg Rehm, Ivan
Riabov, Michael Rießler, Erika Rimkutė, Larissa Rinaldi, Laura Rituma,
Putri Rizqiyah, Luisa Rocha, Eiríkur Rögnvaldsson, Ivan Roksandic,
Mykhailo Romanenko, Rudolf Rosa, Valentin Roșca, Davide Rovati, Ben
Rozonoyer, Olga Rudina, Jack Rueter, Kristján Rúnarsson, Shoval Sadde,
Pegah Safari, Benoît Sagot, Aleksi Sahala, Shadi Saleh, Alessio
Salomoni, Tanja Samardžić, Stephanie Samson, Manuela Sanguinetti, Ezgi
Sanıyar, Dage Särg, Marta Sartor, Mitsuya Sasaki, Baiba Saulīte, Yanin
Sawanakunanon, Shefali Saxena, Kevin Scannell, Salvatore Scarlata,
Nathan Schneider, Sebastian Schuster, Lane Schwartz, Djamé Seddah,
Wolfgang Seeker, Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada,
Hiroyuki Shirasu, Yana Shishkina, Muh Shohibussirri, Maria Shvedova,
Janine Siewert, Einar Freyr Sigurðsson, João Ricardo Silva, Aline
Silveira, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó,
Mária Šimková, Haukur Barri Símonarson, Kiril Simov, Dmitri Sitchinava,
Maria Skachedubova, Aaron Smith, Isabela Soares-Bastos, Barbara
Sonnenhauser, Shafi Sourov, Carolyn Spadine, Rachele Sprugnoli, Vivian
Stamou, Steinþór Steingrímsson, Antonio Stella, Abishek Stephen, Milan
Straka, Emmett Strickland, Jana Strnadová, Alane Suhr, Yogi Lesmana
Sulestio, Umut Sulubacak, Shingo Suzuki, Daniel Swanson, Zsolt Szántó,
Chihiro Taguchi, Dima Taji, Yuta Takahashi, Fabio Tamburini, Mary Ann C.
Tan, Takaaki Tanaka, Dipta Tanaya, Mirko Tavoni, Samson Tella, Isabelle
Tellier, Marinella Testori, Guillaume Thomas, Sara Tonelli, Liisi Torga,
Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Türk,
Francis Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire
Uematsu, Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit,
Andrius Utka, Elena Vagnoni, Sowmya Vajjala, Rob van der Goot, Martine
Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Uliana
Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie, Veronika
Vincze, Natalia Vlasova, Aya Wakasa, Joel C. Wallenberg, Lars Wallin,
Abigail Walsh, Jing Xian Wang, Jonathan North Washington, Maximilan
Wendt, Paul Widmer, Shira Wigderson, Sri Hartati Wijono, Vanessa
Berwanger Wille, Seyi Williams, Mats Wirén, Christian Wittern, Tsegay
Woldemariam, Tak-sum Wong, Alina Wróblewska, Mary Yako, Kayo Yamashita,
Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife
Betül Yenice, Olcay Taner Yıldız, Zhuoran Yu, Arlisa Yuliawati, Zdeněk
Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou, Hanzhi Zhu, Anna
Zhuravleva, Rayan Ziane
References
Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel
Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2,
pp. 255–308.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič,
Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis
Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing
Multilingual Treebank Collection. In Proceedings of LREC.
--------------------------------------------------------------------------------
Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.
Manning. 2006. Generating typed dependency parses from phrase structure
parses. In Proceedings of LREC.
Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The
Stanford typed dependencies representation. In COLING Workshop on
Cross-framework and Cross-domain Parser Evaluation.
Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri
Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014.
Universal Stanford Dependencies: A cross-linguistic typology. In
Proceedings of LREC.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg,
Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo
Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal
Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal
part-of-speech tagset. In Proceedings of LREC.
Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In
Proceedings of LREC.
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]