Dear all,

If you are involved in (web) corpora creation and curation, interested in large multilingual corpora for European languages, or working with automatic genre annotation, the following resources might be useful for you. Multiple multilingual genre-related resources and technologies are now available on the CLARIN.SI and Hugging Face repositories: - ๐—š๐—ฒ๐—ป๐—ฟ๐—ฒ-๐—ฒ๐—ป๐—ฟ๐—ถ๐—ฐ๐—ต๐—ฒ๐—ฑ ๐— ๐—ฎ๐—–๐—ผ๐—–๐˜‚-๐—š๐—ฒ๐—ป๐—ฟ๐—ฒ ๐˜„๐—ฒ๐—ฏ ๐—ฐ๐—ผ๐—ฟ๐—ฝ๐—ผ๐—ฟ๐—ฎ - MaCoCu web corpora for 13 European languages (Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian), automatically annotated with genre labels. In total, the corpus collection comprises 67 million texts and 28.5 billion words. They are available on the CLARIN.SI repository: http://hdl.handle.net/11356/1969

- ๐—ซ-๐—š๐—˜๐—ก๐—ฅ๐—˜ ๐—ฐ๐—น๐—ฎ๐˜€๐˜€๐—ถ๐—ณ๐—ถ๐—ฒ๐—ฟ - multilingual text genre classifier, applicable to any of the 100 languages that are included in the XLM-RoBERTa model - available on Hugging Face (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier) and CLARIN.SI repository (http://hdl.handle.net/11356/1961)

- ๐—˜๐—ป๐—ด๐—น๐—ถ๐˜€๐—ต-๐—ฆ๐—น๐—ผ๐˜ƒ๐—ฒ๐—ป๐—ถ๐—ฎ๐—ป ๐—ซ-๐—š๐—˜๐—ก๐—ฅ๐—˜ ๐—ฑ๐—ฎ๐˜๐—ฎ๐˜€๐—ฒ๐˜ - manually-annotated genre dataset, used for training and evaluation of the X-GENRE classifier - available on Hugging Face (https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset) and CLARIN.SI repository (http://hdl.handle.net/11356/1960).

Additionally, we set up a ๐—ฏ๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐—ฎ๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ ๐—ด๐—ฒ๐—ป๐—ฟ๐—ฒ ๐—ถ๐—ฑ๐—ฒ๐—ป๐˜๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark) for continuous evaluation of the emerging technologies on this task. The benchmark is based on unpublished manually-annotated datasets - if you wish to test your own systems on the task, let me know, and we'll be happy to share them with you.

Best regards,

--


   TajaKuzman

Research Assistant

Department of Knowledge Technologiesย | Joลพef Stefan Institute, Slovenia

CLASSLA Knowledge Centre for South Slavic languages | CLARIN.SI




        
twitter <https://twitter.com/TajaKuzman>  

        linkedin <https://www.linkedin.com/in/taja-kuzman/>       

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to