Dear colleagues,

[Apologies for cross-posting]

In 2024, SIGTYP is hosting a *Shared Task on Word Embedding Evaluation for
Ancient and Historical Languages*: https://sigtyp.github.io/st2024.html The
workshop will be co-located with EACL.

*Summary*
In recent years, sets of downstream tasks called benchmarks have become a
very popular, if not default, method to evaluate general-purpose word and
sentence embeddings. Starting with decaNLP (McCann et al., 2018) and
SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep
appearing and improving every year. However, even the largest multilingual
benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020;
Liang et al., 2020; Ruder et al., 2021, 2023), only include modern
languages. When it comes to ancient and historical languages, scholars
mostly adapt/translate intrinsic evaluation datasets from modern languages
or create their own diagnostic tests. We argue that there is a need for a
universal evaluation benchmark for embeddings learned from ancient and
historical language data and view this shared task as a proving ground for
it.

The shared task involves solving the following problems for 12+ ancient and
historical languages that belong to 4 language families and use 6 different
scripts. Participants will be invited to describe their system in a paper
for the SIGTYP workshop proceedings. The task organisers will write an
overview paper that describes the task and summarises the different
approaches taken, and analyses their results.

*Subtasks*
For subtask A, participants are not allowed to use any additional data;
however, they can reduce and balance provided training datasets if they see
fit. For subtask B, participants are allowed to use any additional data in
any language, including pre-trained embeddings and LLMs.

A. Constrained

   1.     POS-tagging
   2.     Full morphological annotation
   3.     Lemmatisation

B. Unconstrained

   1.     POS-tagging
   2.     Full morphological annotation
   3.     Lemmatisation
   4.     Filling the gaps
      - Word-level
      - Character-level

*Data*
For tasks 1-3, we use Universal Dependencies v. 2.12 data (Zeman et al.,
2023) in 11 ancient and historical languages, complemented by 5 Old
Hungarian codices from the MGTSZ website (HAS Research Institute for
Linguistics, 2018) that are annotated to the same standard as the corpora
available through UD. For task 4, we add historical Irish data from CELT (Ó
Corráin et al., 1997), Corpas Stairiúil na Gaeilge (Acadamh Ríoga na
hÉireann, 2017), and digital editions of the St. Gall glosses (Bauer et
al., 2017) and the Würzburg glosses (Doyle, 2018) as a case study of how
performance may vary on different historical stages of the same language.
We set the upper temporal boundary to 1700 CE and do not include texts
created later than this date in our dataset. List of languages:

   - Ancient Greek
   - Ancient Hebrew
   - Classical Chinese
   - Coptic
   - Gothic
   - Classical, Late & Medieval Latin
   - Medieval Icelandic
   - Old Church Slavonic
   - Old East Slavic
   - Old French
   - Old Hungarian
   - Old, Middle & Early Modern Irish
   - Vedic Sanskrit

*Important dates*

    *05 Nov 2023*: Release of training and validation data
    *02 Jan 2024*: Release of test data
    *08 Jan 2024*: Submission of the systems
    *13 Jan 2024*: Notification of results
    *20 Jan 2024*: Submission of shared task papers
    *27 Jan 2024*: Notification of acceptance to authors
    *03 Feb 2024*: Camera-ready
    *15 Mar 2024*: Video recordings due
    *21/22 Mar 2024*: SIGTYP workshop

*Important links*

   - *Registration form*
   
<https://docs.google.com/forms/d/e/1FAIpQLSdINgMfzzZGIZ-uBVQhvyndB6yeaaj-wT7v45A6UB4F2h6QBQ/viewform?usp=sf_link>
   - Data + detailed description: https://github.com/sigtyp/ST2024

*Task organisers*

   - Oksana Dereza, Insight SFI Research Centre for Data Analytics, Data
   Science Institute, University of Galway
   - Priya Rani, SFI Centre for Research and Training in AI, Data Science
   Institute, University of Galway
   - Atul Kr. Ojha, Insight SFI Research Centre for Data Analytics, Data
   Science Institute, University of Galway
   - Adrian Doyle, Insight SFI Research Centre for Data Analytics, Data
   Science Institute, University of Galway
   - Pádraic Moran, School of Languages, Literatures and Cultures, Moore
   Institute, University of Galway
   - John P. McCrae, Insight SFI Research Centre for Data Analytics, Data
   Science Institute, University of Galway

*Contact details*

   - Oksana: [email protected]
   - Priya: [email protected]


Best wishes,
Oksana and the organisers

-- 
[image: https://nuig.insight-centre.org/]
<https://www.insight-centre.org/>

Oksana Dereza  | PhD student on the Cardamom
<http://cardamom.insight-centre.org/> project | Unit for Linguistic Data |
Insight Centre for Data Analytics | Data Science Institute | University of
Galway

Oksana Dereza  | Iarrthóir PhD ar thionscadal Cardamom
<http://cardamom.insight-centre.org/> | An tAonad um Shonraí Teangeolaíocha
| Insight, Ionad na hAnailísíochta Sonraí | Institiúid Eolaíochta Sonraí |
Ollscoil na Gaillimhe
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to