** *Dear all,*
***The CLASSLA Knowledge centre for South Slavic languages <https://www.clarin.si/info/k-centre/>is delighted to announce the release of comparable web corpora for all official South Slavic languages, namely Slovenian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_sl>(1.8 billion words), Croatian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_hr>(2.2 billion words), Bosnian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_bs>(802 million words), Montenegrin <https://www.clarin.si/ske/#concordance?corpname=classlaweb_cnr>(151 million words), Serbian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_sr>(2.3 billion words), Macedonian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_mk>(479 million words) and Bulgarian <https://www.clarin.si/ske/#concordance?corpname=classlaweb_bg>(3.3 billion words), all these corpora summing up to almost 11 billion words! The linguistic annotation was performed with the state-of-the-art CLASSLA-Stanza <https://pypi.org/project/classla/>toolkit, which you can now try out also through the CLASSLA annotator web interface <https://clarin.si/oznacevalnik/eng>! Additionally, each of the 26 million documents in these seven corpora is annotated with the X-GENRE multilingual genre classifier <https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier>, enabling creation of subcorpora based on genre information. Interestingly, while the corpora were developed with the same pipeline, the genre distribution across corpora is rather varying. If you are interested in more details on the sizes, genre distributions and additional insights, we warmly invite you to read our blog post <https://www.clarin.si/info/k-centre/comparable-classla-web-corpora-of-south-slavic-languages/>.*
**** Best regards, Nikola Ljubešić, Taja Kuzman and many other CLASSLAers
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
