[Corpora-List] [2nd call] SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages

Oksana Dereza via Corpora Wed, 03 Jan 2024 16:27:50 -0800

Dear colleagues,

[apologies for cross-posting]

We would like to remind you that this year SIGTYP is hosting a Shared Task
on Word Embedding Evaluation for Ancient and Historical Language:
https://github.com/sigtyp/ST2024/

Test data has been released, and CodaLab competitions are up and running,
so we encourage you to register if you still haven't! There is still a week
before the deadline. :)

*Summary*
In recent years, sets of downstream tasks called benchmarks have become a
very popular, if not default, method to evaluate general-purpose word and
sentence embeddings. Starting with decaNLP (McCann et al., 2018) and
SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep
appearing and improving every year. However, even the largest multilingual
benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020;
Liang et al., 2020; Ruder et al., 2021, 2023), only include modern
languages. When it comes to ancient and historical languages, scholars
mostly adapt/translate intrinsic evaluation datasets from modern languages
or create their own diagnostic tests. We argue that there is a need for a
universal evaluation benchmark for embeddings learned from ancient and
historical language data and view this shared task as a proving ground for
it.

The shared task involves solving the following problems for 12+ ancient and
historical languages that belong to 4 language families and use 6 different
scripts. Participants will be invited to describe their system in a paper
for the SIGTYP workshop proceedings. The task organizers will write an
overview paper that describes the task and summarizes the different
approaches taken, and analyzes their results.

*Subtasks*
For subtask A, participants are not allowed to use any additional data;
however, they can reduce and balance provided training datasets if they see
fit. For subtask B, participants are allowed to use any additional data in
any language, including pre-trained embeddings and LLMs.

A. Constrained

1. POS-tagging
2. Full morphological annotation
3. Lemmatisation

B. Unconstrained

1. POS-tagging
2. Detailed morphological annotation
3. Lemmatisation
4. Filling the gaps
- Word-level
- Character-level

*Important links*

- *Registration form*

<https://docs.google.com/forms/d/e/1FAIpQLSdINgMfzzZGIZ-uBVQhvyndB6yeaaj-wT7v45A6UB4F2h6QBQ/viewform?usp=sf_link>
- Detailed description, incl. submission format: https://github.com/
sigtyp/ST2024 <https://github.com/sigtyp/ST2024>
- Constrained subtask on CodaLab:
https://codalab.lisn.upsaclay.fr/competitions/16822
- Unconstrained subtask on CodaLab:
https://codalab.lisn.upsaclay.fr/competitions/16818

*Important dates*

*05 Nov 2023*: Release of training and validation data
*02 Jan 2024*: Release of test data
- * 09 Jan 2024:* Submission of results for Phase 1 of the Constrained
Subtask
- * 12 Jan 2024:* Submission of results for Phase 2 of the Constrained
Subtask and for the Unconstrained Subtask *13 Jan 2024*: Notification of
results
*20 Jan 2024*: Submission of shared task papers
*27 Jan 2024*: Notification of acceptance to authors
*03 Feb 2024*: Camera-ready
*15 Mar 2024*: Video recordings due
*21/22 Mar 2024*: SIGTYP workshop

Kind regards,

Oksana and the organisers' team

--
[image: https://nuig.insight-centre.org/]
<https://www.insight-centre.org/>

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] [2nd call] SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages

Reply via email to