[Corpora-List] Genre-enriched web corpora and multilingual genre classifier

Taja Kuzman via Corpora Fri, 25 Oct 2024 00:18:28 -0700

Dear all,

If you are involved in (web) corpora creation and curation, interestedin large multilingual corpora for European languages, or working withautomatic genre annotation, the following resources might be useful foryou. Multiple multilingual genre-related resources and technologies arenow available on the CLARIN.SI and Hugging Face repositories:- 𝗚𝗲𝗻𝗿𝗲-𝗲𝗻𝗿𝗶𝗰𝗵𝗲𝗱 𝗠𝗮𝗖𝗼𝗖𝘂-𝗚𝗲𝗻𝗿𝗲 𝘄𝗲𝗯𝗰𝗼𝗿𝗽𝗼𝗿𝗮 - MaCoCu web corpora for 13 European languages (Albanian,Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian,Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian), automaticallyannotated with genre labels. In total, the corpus collection comprises67 million texts and 28.5 billion words. They are available on theCLARIN.SI repository: http://hdl.handle.net/11356/1969

- 𝗫-𝗚𝗘𝗡𝗥𝗘 𝗰𝗹𝗮𝘀𝘀𝗶𝗳𝗶𝗲𝗿 - multilingual text genreclassifier, applicable to any of the 100 languages that are included inthe XLM-RoBERTa model - available on Hugging Face(https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier)and CLARIN.SI repository (http://hdl.handle.net/11356/1961)

- 𝗘𝗻𝗴𝗹𝗶𝘀𝗵-𝗦𝗹𝗼𝘃𝗲𝗻𝗶𝗮𝗻 𝗫-𝗚𝗘𝗡𝗥𝗘 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 -manually-annotated genre dataset, used for training and evaluation ofthe X-GENRE classifier - available on Hugging Face(https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset)and CLARIN.SI repository (http://hdl.handle.net/11356/1960).

Additionally, we set up a 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝗳𝗼𝗿 𝗮𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗰𝗴𝗲𝗻𝗿𝗲 𝗶𝗱𝗲𝗻𝘁𝗶𝗳𝗶𝗰𝗮𝘁𝗶𝗼𝗻(https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark)for continuous evaluation of the emerging technologies on this task. Thebenchmark is based on unpublished manually-annotated datasets - if youwish to test your own systems on the task, let me know, and we'll behappy to share them with you.


Best regards,

--


   TajaKuzman

Research Assistant

Department of Knowledge Technologies | Jožef Stefan Institute, Slovenia

CLASSLA Knowledge Centre for South Slavic languages | CLARIN.SI




        
twitter <https://twitter.com/TajaKuzman>  

        linkedin <https://www.linkedin.com/in/taja-kuzman/>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Genre-enriched web corpora and multilingual genre classifier

Reply via email to