LLMs with Limited Resources for Slavic Languages @ WMT2025 @ EMNLP2025

Website: https://www2.statmt.org/wmt25/limited-resources-slavic-llm.html

Join our Google Group! https://groups.google.com/g/slavic-llms-mt2025

HuggingFace Collection:
https://huggingface.co/collections/tum-nlp/llms-for-slavic-languages-67f3993bf057be6a8d6665ab

This shared task explores how LLMs perform on MT and QA jointly, aiming to
investigate task synergy under limited data and compute resources.
Ukrainian (uk) is a mid-resource language (~40M L1 speakers), while Upper
Sorbian (hsb) and Lower Sorbian (dsb) are minority West Slavic languages
(30k and 7k L1 speakers, respectively) spoken in Germany.

Data Overview

Ukrainian

   -

   MT directions: en→uk, cs→uk
   -

   QA: Derived from high-school graduation exams (ZNO)
   -

   Training sets examples:
   -

      MT: WMT24++ <https://huggingface.co/datasets/google/wmt24pp>, SMOL
      <https://huggingface.co/datasets/google/smol>
      -

      QA: UNLP2024 <https://huggingface.co/datasets/osyvokon/zno>, ZNO-EVAL
      <https://github.com/NLPForUA/ZNO>, Cohere INCLUDE
      <https://huggingface.co/datasets/CohereForAI/include-base-44>

Upper Sorbian & Lower Sorbian (two separate tracks)

   -

   MT directions: de→hsb, de→dsb
   -

   QA: Multiple-choice questions based on actual CEFR-based language
   certification exams (A1–C1 levels)
   -

   We will prepare the following resources:
   -

      Parallel & monolingual corpora via Witaj-Sprachzentrum and Leipzig
      Corpora Collection;
      -

      Previous WMT low-resource tracks (2020–2022);
      -

      QA task adapted from language certifications of different levels.

Submission Guidelines

   -

   Models must produce both MT & QA outputs for the chosen language(s);
   -

   Submissions are language-specific; submit to one or multiple language
   tracks;
   -

   Participants can only use one of the following base models that are
restricted
   to 3B parameters maximum:
   -

      Qwen2.5-3B-Instruct <https://huggingface.co/Qwen/Qwen2.5-3B-Instruct>
      -

      Qwen2.5-1.5B <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>
      -

      Qwen2.5-0.5B <https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct>
      -

      Quantized or Unsloth variants from HuggingFace collections

Key Dates (AoE)

   -

   Registration opens now!: Join our Google group
   https://groups.google.com/g/slavic-llms-mt2025
   -

   Training/dev data release: Late April
   -

   Test data release: Late June
   -

   Submission deadline: Early July
   -

   System description deadline: Late July
   -

   Final workshop: 5-9th November @ EMNLP 2025 in Suzhou, China!

Organisers

TUM Heilbronn:

Daryna Dementieva
Marion di Marco
Lukas Edman
Alexander Fraser
Kathy Hämmerl
Shu Okabe

Witaj-Sprachzentrum:

Beate Brězan,
Anita Hendrichowa
Marko Měškank
Tomaš Šołta

Acknowledgements
We thank the UNLP 2024 Shared Task team (Roman Kyslyi, Mariana Romanyshyn,
Oleksiy Syvokon) for kindly sharing Ukrainian QA resources.

Best regards,
Daryna Dementieva
On behalf of TUM Heilbronn Workshop Organizers
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to