----- Original Message ----- > We are very happy to announce the twentieth release of annotated > treebanks in Universal Dependencies, v2.14, available at > https://universaldependencies.org/. > > Universal Dependencies is a project that seeks to develop > cross-linguistically consistent treebank annotation for many languages > with the goal of facilitating multilingual parser development, > cross-lingual learning, and parsing research from a language typology > perspective (de Marneffe et al., 2021; Nivre et al., 2020). The > annotation scheme is based on (universal) Stanford dependencies (de > Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags > (Petrov et al., 2012), and the Interset interlingua for morphosyntactic > tagsets (Zeman, 2008). The general philosophy is to provide a universal > inventory of categories and guidelines to facilitate consistent > annotation of similar constructions across languages, while allowing > language-specific extensions when necessary. > > The *283* treebanks in v2.14 are annotated according to version 2 of the > UD guidelines and represent the following *161* languages: Abaza, > Abkhaz, Afrikaans, Akkadian, Akuntsu, Albanian, Amharic, Ancient Greek, > Ancient Hebrew, Apurina, Arabic, Armenian, Assyrian, Azerbaijani, > Bambara, Basque, Bavarian, Beja, Belarusian, Bengali, Bhojpuri, Bororo, > Breton, Bulgarian, Buryat, Cantonese, Cappadocian, Catalan, Cebuano, > Chinese, Chukchi, Classical Armenian, Classical Chinese, Coptic, > Croatian, Czech, Danish, Dutch, Egyptian, English, Erzya, Estonian, > Faroese, Finnish, French, Frisian Dutch, Galician, Georgian, German, > Gheg, Gothic, Greek, Guajajara, Guarani, Gujarati, Haitian Creole, > Hausa, Hebrew, Highland Puebla Nahuatl, Hindi, Hittite, Hungarian, > Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kaapor, > Kangri, Karelian, Karo, Kazakh, Khunsari, Kiche, Komi Permyak, Komi > Zyrian, Korean, Kurmanji, Kyrgyz, Latgalian, Latin, Latvian, Ligurian, > Lithuanian, Livvi, Low Saxon, Luxembourgish, Macedonian, Madi, Maghrebi > Arabic French, Makurap, Malayalam, Maltese, Manx, Marathi, Mbya Guarani, > Middle French, Moksha, Munduruku, Naija, Nayini, Neapolitan, Nheengatu, > North Sami, Norwegian, Old Church Slavonic, Old East Slavic, Old French, > Old Irish, Old Turkish, Ottoman Turkish, Paumari, Persian, Polish, > Pomak, Portuguese, Romanian, Russian, Sanskrit, Scottish Gaelic, > Serbian, Sinhala, Skolt Sami, Slovak, Slovenian, Soi, South Levantine > Arabic, Spanish, Swedish, Swedish Sign Language, Swiss German, Tagalog, > Tamil, Tatar, Teko, Telugu, Telugu English, Thai, Tswana, Tupinamba, > Turkish, Turkish German, Ukrainian, Umbrian, Upper Sorbian, Urdu, > Uyghur, Veps, Vietnamese, Warlpiri, Welsh, Western Armenian, Western > Sierra Puebla Nahuatl, Wolof, Xavante, Xibe, Yakut, Yoruba, Yupik and > Zaar. The 161 languages belong to *31* families: Afro-Asiatic, Arawakan, > Arawan, Austro-Asiatic, Austronesian, Basque, Bororoan, > Chukotko-Kamchatkan, Code switching, Creole, Dravidian, Eskimo-Aleut, > Indo-European, Japanese, Kartvelian, Korean, Macro-Je, Mande, Mayan, > Mongolic, Niger-Congo, Northwest Caucasian, Pama-Nyungan, Sign Language, > Sino-Tibetan, Tai-Kadai, Tungusic, Tupian, Turkic, Uralic and > Uto-Aztecan. Depending on the language, the treebanks range in size from > less than 1,000 tokens to over 3 million tokens. We expect the next > release to be available in November 2024. > > The size of the following 39 treebanks changed significantly since the > last release: > Abkhaz AbNC : 0 → 2444 > Azerbaijani TueCL : 0 → 656 > Bavarian MaiBaam : 0 → 15024 > Beja NSC : 1206 → 5888 > Bororo BDT : 1905 → 6993 > Cappadocian TueCL : 0 → 4118 > Classical Armenian CAVaL: 13522 → 81996 > Classical Chinese TueCL : 0 → 648 > Dutch LassySmall : 98241 → 297486 > Egyptian UJaen : 0 → 5515 > English CTeTex : 0 → 9273 > English GUM : 187522 → 212035 > Galician PUD : 0 → 23510 > Gujarati GujTB : 0 → 1885 > Hausa NorthernAutogramm : 0 → 3919 > Hausa SouthernAutogramm : 0 → 14585 > Italian Old : 41367 → 82644 > Kyrgyz TueCL : 0 → 1001 > Latgalian Cairo : 0 → 173 > Latin CIRCSE : 0 → 18968 > Latvian Cairo : 0 → 171 > Low Saxon LSDC : 4683 → 22639 > Luxembourgish LuxBank : 0 → 206 > Nheengatu CompLin : 12743 → 15036 > Old East Slavic RNC : 48647 → 95551 > Old Turkish Clausal : 0 → 158 > Old Turkish Tonqq : 158 → 0 > Ottoman Turkish BOUN : 0 → 8814 > Ottoman Turkish DUDU : 0 → 813 > Paumari TueCL : 0 → 504 > Pomak Philotis : 86780 → 34348 > Romanian TueCL : 0 → 4417 > Sanskrit Vedic : 27117 → 206440 > Slovenian SST : 29488 → 76341 > Spanish COSER : 0 → 8073 > Telugu English TECT : 0 → 456 > Tswana Popapolelo : 0 → 214 > Vietnamese TueCL : 0 → 1888 > Zaar Autogramm : 7625 → 17682 > > In total, the new release contains *1,906,050* sentences, 31,541,523 > surface tokens and *32,179,731* syntactic words. > > Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, Noëmi > Aepli, Hamid Aghaei, Željko Agić, Amir Ahmadi, Lars Ahrenberg, Chika > Kennedy Ajede, Salih Furkan Akkurt, Gabrielė Aleksandravičiūtė, Ika > Alfina, Avner Algom, Khalid Alnajjar, Chiara Alzetta, Erik Andersen, > Lene Antonsen, Tatsuya Aoyama, Katya Aplonova, Angelina Aquino, Carolina > Aragon, Glyd Aranes, Maria Jesus Aranzabe, Bilge Nas Arıcan, Þórunn > Arnardóttir, Gashaw Arutie, Jessica Naraiswari Arwidarasti, Masayuki > Asahara, Katla Ásgeirsdóttir, Deniz Baran Aslan, Cengiz Asmazoğlu, Luma > Ateyah, Furkan Atmaca, Mohammed Attia, Aitziber Atutxa, Liesbeth > Augustinus, Mariana Avelãs, Elena Badmaeva, Keerthana Balasubramani, > Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu > Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov, Colin > Batchelor, John Bauer, Seyyit Talha Bedir, Shabnam Behzad, Juan Belieni, > Kepa Bengoetxea, İbrahim Benli, Yifat Ben Moshe, Ansu Berg, Gözde Berk, > Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agnė Bielinskienė, Esma > Fatıma Bilgin Taşdemir, Kristín Bjarnadóttir, Verena Blaschke, Rogier > Blokland, Victoria Bobicev, Loïc Boizou, Johnatan Bonilla, Emanuel > Borges Völker, Carl Börstell, Cristina Bosco, Gosse Bouma, Sam Bowman, > Adriane Boyd, Anouck Braggaar, António Branco, Kristina Brokaitė, > Aljoscha Burchardt, Marisa Campos, Marie Candito, Bernard Caron, > Gauthier Caron, Catarina Carvalheiro, Rita Carvalho, Lauren Cassidy, > Maria Clara Castro, Sérgio Castro, Tatiana Cavalcanti, Gülşen Cebiroğ lu > Eryiğit, Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavomír > Čéplö, Neslihan Cesur, Savas Cetin, Özlem Çetinoğlu, Fabricio Chalub, > Liyanage Chamila, Shweta Chauhan, Yifei Chen, Ethan Chi, Taishi Chika, > Yongseok Cho, Jinho Choi, Bermet Chontaeva, Jayeol Chun, Juyeon Chung, > Alessandra T. Cignarella, Silvie Cinková, Aurélie Collomb, Çağrı > Çöltekin, Miriam Connor, Claudia Corbetta, Daniela Corbetta, Francisco > Costa, Marine Courtin, Benoît Crabbé, Mihaela Cristescu, Vladimir > Cvetkoski, Ingerid Løyning Dale, Philemon Daniel, Elizabeth Davidson, > Leonel Figueiredo de Alencar, Mathieu Dehouck, Martina de Laurentiis, > Marie-Catherine de Marneffe, Valeria de Paiva, Mehmet Oguz Derin, Elvis > de Souza, Arantza Diaz de Ilarraza, Roberto Antonio Díaz Hernández, > Carly Dickerson, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione, > Peter Dirix, Hoa Do, Kaja Dobrovoljc, Caroline Döhmer, Adrian Doyle, > Timothy Dozat, Kira Droganova, Magali Sanches Duran, Puneet Dwivedi, > Christian Ebert, Hanne Eckhoff, Masaki Eguchi, Sandra Eiche, Roald > Eiselen, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaž > Erjavec, Soudabeh Eslami, Farah Essaidi, Aline Etienne, Wograine Evelyn, > Sidney Facundes, Richárd Farkas, Federica Favero, Jannatul Ferdaousi, > Marília Fernanda, Hector Fernandez Alcalde, Amal Fethi, Jennifer Foster, > Theodorus Fransen, Cláudia Freitas, Kazunori Fujita, Katarína Gajdošov á, > Daniel Galbraith, Edith Galy, Federica Gamba, Marcos Garcia, Moa > Gärdenfors, Tanja Gaustad, Efe Eren Genç, Fabrício Ferraz Gerardi, Kim > Gerdes, Luke Gessler, Filip Ginter, Gustavo Godoy, Iakes Goenaga, Koldo > Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta > González Saavedra, Bernadeta Griciūtė, Matias Grioni, Loïc Grobol, > Normunds Grūzītis, Bruno Guillaume, Kirian Guiller, Céline > Guillot-Barbance, Tunga Güngör, Nizar Habash, Hinrik Hafsteinsson, Jan > Hajič, Jan Hajič jr., Mika Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Muhammad > Yudistira Hanifmuti, Takahiro Harada, Sam Hardwick, Kim Harris, Naïma > Hassert, Dag Haug, Johannes Heinecke, Oliver Hellwig, Felix Hennig, > Barbora Hladká, Jaroslava Hlaváčová, Florinel Hociung, Diana Hoefels, > Petter Hohle, Yidi Huang, Marivel Huerta Mendez, Jena Hwang, Takumi > Ikeda, Inessa Iliadou, Anton Karl Ingason, Radu Ion, Elena Irimia, > Ọlájídé Ishola, Artan Islamaj, Kaoru Ito, Federica Iurescia, Sandra > Jagodzińska, Siratun Jannat, Tomáš Jelínek, Apoorva Jha, Katharine > Jiang, Mayank Jobanputra, Anders Johannsen, Hildur Jónsdóttir, Fredrik > Jørgensen, Markus Juutinen, Hüner Kaşıkara, Nadezhda Kabaeva, Sylvain > Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara, Ritván Karahóǧ a, > Andre Kåsen, Tolga Kayadelen, Sarveswaran Kengatharaiyer, Václava > Kettnerová, Lilit Kharatyan, Jesse Kirchner, Elena Klementieva, Elena > Klyachko, Petr Kocharov, Arne Köhn, Abdullatif Köksal, Kamil Kopacewicz, > Timo Korkiakangas, Mehmet Köse, Alexey Koshevoy, Natalia Kotsyba, > Barbara Kovačić, Jolanta Kovalevskaitė, Simon Krek, Parameswari > Krishnamurthy, Sandra Kübler, Adrian Kuqi, Oğuzhan Kuyrukçu, Aslı > Kuzgun, Sookyoung Kwak, Kris Kyle, Käbi Laan, Veronika Laippala, Lorenzo > Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, > John Lee, Phương Lê Hồng, Alessandro Lenci, Saran Lertpradit, Herman > Leung, Maria Levina, Lauren Levine, Cheuk Ying Li, Josie Li, Keying Li, > Yixuan Li, Yuan Li, KyungTae Lim, Bruna Lima Padovani, Yi-Ju Jessica > Lin, Krister Lindén, Yang Janet Liu, Nikola Ljubešić, Irina Lobzhanidze, > Olga Loginova, Lucelene Lopes, Stefano Lusito, Anne-Marie Lutgen, Andry > Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, > Menel Mahamdi, Jean Maillard, Ilya Makarchuk, Aibek Makazhanov, > Francesco Mambrini, Michael Mandl, Christopher Manning, Ruli Manurung, > Büşra Marşan, Cătălina Mărănduc, David Mareček, Katrin Marheinecke, > Stella Markantonatou, Héctor Martínez Alonso, Lorena Martín Rodríguez, > André Martins, Cláudia Martins, Jan Mašek, Hiroshi Matsuda, Yuji > Matsumoto, Alessandro Mazzei, Ryan McDonald, Sarah McGuinness, Maitrey > Mehta, Pierre André Ménard, Gustavo Mendonça, Tatiana Merzhevich, Paul > Meurer, Niko Miekka, Emilia Milano, Aaron Miller, Karina Mischenkova, > Anna Missilä, Cătălin Mititelu, Maria Mitrofan, Yusuke Miyao, > AmirHossein Mojiri Foroushani, Judit Molnár, Amirsaeid Moloodi, > Simonetta Montemagni, Amir More, Laura Moreno Romero, Giovanni Moretti, > Shinsuke Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan > Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili > Müürisep, Pinkey Nainwani, Mariam Nakhlé, Juan Ignacio Navarro > Horñiacek, Anna Nedoluzhko, Gunta Nešpore-Bērzkalne, Manuela Nevaci, > Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vitaly > Nikolaev, Rattima Nitisaroj, Victor Norrman, Alireza Nourian, Maria das > Graças Volpe Nunes, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Hulda > Óladóttir, Adédayọ̀ Olúòkun, Mai Omura, Emeka Onwuegbuzia, Noam Ordan, > Petya Osenova, Robert Östling, Annika Ott, Lilja Øvrelid, Şaziye Betül > Özateş, Merve Özçelik, Arzucan Özgür, Balkız Öztürk Başaran, Teresa > Paccosi, Alessio Palmero Aprosio, Anastasia Panova, Thiago Alexandre > Salgueiro Pardo, Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco > Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Giulia > Pedonese, Angelika Peljak-Łapińska, Siyao Peng, Siyao Logan Peng, Rita > Pereira, Sílvia Pereira, Cenel-Augusto Perez, Natalia Perkova, Guy > Perrier, Slav Petrov, Daria Petrova, Andrea Peverelli, Jason Phelan, > Claudel Pierre-Louis, Jussi Piitulainen, Yuval Pinter, Clara Pinto, > Rodrigo Pintucci, Tommi A Pirinen, Emily Pitler, Magdalena Plamada, > Barbara Plank, Alistair Plum, Thierry Poibeau, Larisa Ponomareva, Martin > Popel, Lauma Pretkalniņa, Rigardt Pretorius, Sophie Prévost, Prokopis > Prokopidis, Adam Przepiórkowski, Robert Pugh, Tiina Puolakainen, > Christoph Purschke, Sampo Pyysalo, Peng Qi, Andreia Querido, Andriela > Rääbis, Alexandre Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan > Ramasamy, Carlos Ramisch, Joana Ramos, Fam Rashel, Mohammad Sadegh > Rasooli, Vinit Ravishankar, Livy Real, Petru Rebeja, Siva Reddy, > Mathilde Regnault, Georg Rehm, Arij Riabi, Ivan Riabov, Michael Rieß ler, > Erika Rimkutė, Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa > Rocha, Eiríkur Rögnvaldsson, Ivan Roksandic, Mykhailo Romanenko, Rudolf > Rosa, Valentin Roșca, Davide Rovati, Ben Rozonoyer, Olga Rudina, Jack > Rueter, Paolo Ruffolo, Kristján Rúnarsson, Shoval Sadde, Pegah Safari, > Aleksi Sahala, Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Stephanie > Samson, Xulia Sánchez-Rodríguez, Manuela Sanguinetti, Ezgi Sanıyar, Dage > Särg, Marta Sartor, Albina Sarymsakova, Mitsuya Sasaki, Baiba Saulīte, > Agata Savary, Yanin Sawanakunanon, Shefali Saxena, Kevin Scannell, > Salvatore Scarlata, Emmanuel Schang, Nathan Schneider, Sebastian > Schuster, Lane Schwartz, Djamé Seddah, Wolfgang Seeker, Sven Sellmer, > Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Hiroyuki > Shirasu, Yana Shishkina, Muh Shohibussirri, Maria Shvedova, Janine > Siewert, Einar Freyr Sigurðsson, João Silva, Aline Silveira, Natalia > Silveira, Sara Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, > Mária Šimková, Haukur Barri Símonarson, Kiril Simov, Dmitri Sitchinava, > Ted Sither, Aaron Smith, Isabela Soares-Bastos, Per Erik Solberg, > Barbara Sonnenhauser, Shafi Sourov, Rachele Sprugnoli, Vivian Stamou, > Steinþór Steingrímsson, Antonio Stella, Abishek Stephen, Milan Straka, > Emmett Strickland, Jana Strnadová, Alane Suhr, Yogi Lesmana Sulestio, > Umut Sulubacak, Shingo Suzuki, Daniel Swanson, Zsolt Szántó, Chihiro > Taguchi, Dima Taji, Fabio Tamburini, Mary Ann C. Tan, Takaaki Tanaka, > Dipta Tanaya, Mirko Tavoni, Samson Tella, Isabelle Tellier, Marinella > Testori, Guillaume Thomas, Tarık Emre Tıraş, Sara Tonelli, Liisi Torga, > Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku Tü rk, > Francis Tyers, Sveinbjörn Þórðarson, Vilhjálmur Þorsteinsson, Sumire > Uematsu, Roman Untilov, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, > Andrius Utka, Elena Vagnoni, Sowmya Vajjala, Socrates Vak, Rob van der > Goot, Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor > Varga, Uliana Vedenina, Giulia Venturi, Eric Villemonte de la Clergerie, > Veronika Vincze, Anishka Vissamsetty, Natalia Vlasova, Eleni > Vligouridou, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh, > John Wang, Jonathan North Washington, Maximilan Wendt, Paul Widmer, > Shira Wigderson, Sri Hartati Wijono, Vanessa Berwanger Wille, Seyi > Williams, Mats Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum > Wong, Alina Wróblewska, Qishen Wu, Mary Yako, Kayo Yamashita, Naoki > Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife Betü l > Yenice, Enes Yılandiloğlu, Olcay Taner Yıldız, Zhuoran Yu, Arlisa > Yuliawati, Zdeněk Žabokrtský, Shorouq Zahra, Amir Zeldes, He Zhou, > Hanzhi Zhu, Yilun Zhu, Anna Zhuravleva, Rayan Ziane > > > References > > Marie-Catherine de Marneffe, Christopher Manning, Joakim Nivre, Daniel > Zeman. 2021. Universal Dependencies. In Computational Linguistics 47:2, > pp. 255–308. > > Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, > Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis > Tyers, Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing > Multilingual Treebank Collection. In Proceedings of LREC. > > ---------------------------------------------------------------------- ---------- > > Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. > Manning. 2006. Generating typed dependency parses from phrase structure > parses. In Proceedings of LREC. > > Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The > Stanford typed dependencies representation. In COLING Workshop on > Cross-framework and Cross-domain Parser Evaluation. > > Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri > Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014. > Universal Stanford Dependencies: A cross-linguistic typology. In > Proceedings of LREC. > > Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, > Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo > Pyysalo, Natalia Silveira, Reut Tsarfaty, Daniel Zeman. 2016. Universal > Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC. > > Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal > part-of-speech tagset. In Proceedings of LREC. > > Daniel Zeman. 2008. Reusable Tagset Conversion Using Tagset Drivers. In > Proceedings of LREC. > _______________________________________________ > Corpora mailing list -- [email protected] > https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to [email protected] > _______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
