Dear colleagues, [apologies for cross-posting]
We would like to remind you that this year SIGTYP is hosting a Shared Task on Word Embedding Evaluation for Ancient and Historical Language: https://github.com/sigtyp/ST2024/ Test data has been released, and CodaLab competitions are up and running, so we encourage you to register if you still haven't! There is still a week before the deadline. :) *Summary* In recent years, sets of downstream tasks called benchmarks have become a very popular, if not default, method to evaluate general-purpose word and sentence embeddings. Starting with decaNLP (McCann et al., 2018) and SentEval (Conneau & Kiela, 2018), multitask benchmarks for NLU keep appearing and improving every year. However, even the largest multilingual benchmarks, such as XGLUE, XTREME, XTREME-R or XTREME-UP (Hu et al., 2020; Liang et al., 2020; Ruder et al., 2021, 2023), only include modern languages. When it comes to ancient and historical languages, scholars mostly adapt/translate intrinsic evaluation datasets from modern languages or create their own diagnostic tests. We argue that there is a need for a universal evaluation benchmark for embeddings learned from ancient and historical language data and view this shared task as a proving ground for it. The shared task involves solving the following problems for 12+ ancient and historical languages that belong to 4 language families and use 6 different scripts. Participants will be invited to describe their system in a paper for the SIGTYP workshop proceedings. The task organizers will write an overview paper that describes the task and summarizes the different approaches taken, and analyzes their results. *Subtasks* For subtask A, participants are not allowed to use any additional data; however, they can reduce and balance provided training datasets if they see fit. For subtask B, participants are allowed to use any additional data in any language, including pre-trained embeddings and LLMs. A. Constrained 1. POS-tagging 2. Full morphological annotation 3. Lemmatisation B. Unconstrained 1. POS-tagging 2. Detailed morphological annotation 3. Lemmatisation 4. Filling the gaps - Word-level - Character-level *Important links* - *Registration form* <https://docs.google.com/forms/d/e/1FAIpQLSdINgMfzzZGIZ-uBVQhvyndB6yeaaj-wT7v45A6UB4F2h6QBQ/viewform?usp=sf_link> - Detailed description, incl. submission format: https://github.com/ sigtyp/ST2024 <https://github.com/sigtyp/ST2024> - Constrained subtask on CodaLab: https://codalab.lisn.upsaclay.fr/competitions/16822 - Unconstrained subtask on CodaLab: https://codalab.lisn.upsaclay.fr/competitions/16818 *Important dates* *05 Nov 2023*: Release of training and validation data *02 Jan 2024*: Release of test data - * 09 Jan 2024:* Submission of results for Phase 1 of the Constrained Subtask - * 12 Jan 2024:* Submission of results for Phase 2 of the Constrained Subtask and for the Unconstrained Subtask *13 Jan 2024*: Notification of results *20 Jan 2024*: Submission of shared task papers *27 Jan 2024*: Notification of acceptance to authors *03 Feb 2024*: Camera-ready *15 Mar 2024*: Video recordings due *21/22 Mar 2024*: SIGTYP workshop Kind regards, Oksana and the organisers' team -- [image: https://nuig.insight-centre.org/] <https://www.insight-centre.org/> Oksana Dereza | PhD student on the Cardamom <http://cardamom.insight-centre.org/> project | Unit for Linguistic Data | Insight Centre for Data Analytics | Data Science Institute | University of Galway Oksana Dereza | Iarrthóir PhD ar thionscadal Cardamom <http://cardamom.insight-centre.org/> | An tAonad um Shonraí Teangeolaíocha | Insight, Ionad na hAnailísíochta Sonraí | Institiúid Eolaíochta Sonraí | Ollscoil na Gaillimhe
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
