** CALL FOR PARTICIPATION** Two peas in a pod: PARSEME 2.0 and ADMIRE 2.0 multilingual UniDive shared tasks on idiomaticity and multiword expressions
==================================================================== The UniDive COST Action <https://unidive.lisn.upsaclay.fr/> is happy to announce new versions of the ADMIRE and the PARSEME shared tasks dedicated to idiomaticity and multiword expressions (MWEs). MWEs are groups of words that have non-compositional semantics, i.e. their meanings cannot be straightforwardly deduced from the meanings of their components. For instance, a bad apple is a person who has a bad influence on others. Examples of MWEs include idioms: nominal (This child is a bad apple), adjectival (Most characters in his movies are somewhat larger than life), adverbial (it happens from time to time), functional (I’ll do it on behalf of you), verbal (It's raining cats and dogs today!), as well as light verb constructions (I will pay a visit to my aunt), inherently reflexive verbs (Help yourself to some cake), etc. Both shared tasks will take place together during fall and winter of 2025/2026, despite using different resources and modalities. We hope to co-organise the culminating workshop with SIGLEX-MWE section <https://multiword.org/> and co-locate it with EACL 2026 in Morocco (24-28 March 2026) but this is still to confirm. PARSEME 2.0 is a shared task whose main objective is to identify and paraphrase multiword expressions (MWEs) in written text. Three previous editions of PARSEME shared tasks (1.0, <https://aclanthology.org/W17-1704.pdf> 1.1 <https://aclanthology.org/W18-4925.pdf>, 1.2 <https://aclanthology.org/2020.mwe-1.14.pdf>) were dedicated to identification of MWEs, but have focused on verbal MWEs only. We propose here to deal with all syntactic types of MWEs. Additionally, we deal with paraphrasing MWEs, which is closer to understanding than sheer identification. We propose two subtasks: the first corresponds to the classical identification task in running text. The second consists in paraphrasing a sentence containing a MWE, so as to remove idiomaticity. Data annotation is ongoing and at least 19 languages are expected to be covered: Albanian, Brazilian Portuguese, Dutch, Egyptian (ca. 2700-2000 BC), French, Georgian, Greek (Ancient), Greek (Modern), Hebrew, Italian, Japanese, Latvian, Lithuanian, Persian, Polish, Romanian, Serbian, Swedish, Ukrainian AdMIRe 2.0 (Advancing Multimodal Idiomaticity Representation) is a shared task that addresses the challenge of idiomatic language understanding by evaluating how well computational models interpret potentially idiomatic expressions (PIEs) using both text and images. This new edition extends the first edition of the AdMIRe task <https://arxiv.org/pdf/2503.15358> by broadening language coverage to include around 30 languages from the UNIDIVE network and by introducing parallel idiomaticity datasets designed to assess the cross-lingual and multilingual capabilities of current language technologies. By evaluating how well images reflect idiomatic versus literal meanings, AdMIRe 2.0 establishes a richer benchmark for multimodal, multilingual idiomatic language comprehension. Given a context sentence containing a PIE and a set of five images, the task is to rank the images based on how accurately they depict the meaning of the PIE used in that sentence. The task will be zero-shot for newly introduced languages. Only the English dataset will be provided to teams wishing to fine-tune their models or apply few-shot learning techniques. This setup allows participants to test the generalization and cross-lingual capabilities of their systems while minimizing data preparation efforts. While the task is designed to encourage participation from teams working on multimodal technologies, it also accommodates approaches focused solely on text. To support broader participation and reduce the complexity and computational cost for such teams, the organizers will provide automatically generated descriptive captions for each image, allowing models to rely exclusively on text input if desired. Subtasks ------ PARSEME 2.0 : Subtask 1 : MWEs identification The main goal of this subtask is to evaluate the systems' ability to identify MWEs. Systems must recognize tokens that belong to MWEs in running text. Participants will be strongly encouraged to take interest in the diversity of their predictions, as in addition to performance scores (F-score), the results will be evaluated in terms of the diversity of the MWEs correctly predicted by the systems. Subtask 2 : Paraphrasing MWEs Given a sentence with a MWEs, the task is to generate a new sentence having the same meaning but not containing this MWE. As a continuation of Subtask 1, this task will test the systems' ability to identify MWEs, but also to capture their meaning. Here, again, we encourage systems to produce diverse paraphrases, and diversity measures will be used in evaluation. ADMIRE 2.0 : Static Image Ranking Participants are presented with a context sentence containing a PIE and a set of five images. The objective is to rank these images based on how well they visually represent the idiomatic meaning within that specific context. For each idiom, five different images are generated with a consistent style. These images cover a range of idiomaticity, including a synonym for the idiomatic meaning, a synonym for the literal meaning, something related to the idiomatic meaning but not synonymous, something related to the literal meaning but not synonymous, and a distractor image that is unrelated to both meanings. Which of these images best represents the meaning of the phrase bad apple in the following sentence?: "We have to recognize that this is not the occasional bad apple but a structural, sector-wide problem" How about here? "However, if ethylene happens to be around (say from a bad apple), these fruits do ripen more quickly." Provided data ------------- PARSEME 2.0: The PARSEME community <https://gitlab.com/parseme/corpora/-/wikis> has been involved in a long-standing effort of universalist modelling and annotation of MWEs across many languages. Corpora for 30 languages annotated for verbal MWEs have been published in the past. For Subtask 1 (identification), we are now in the process of annotating other syntactic types of MWEs (nominal, adjectival, etc.). The annotation follows the guidelines <https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/> unified across 30 languages. The training data will contain several hundred to several thousand annotated MWEs depending on the language. The cupt format <https://gitlab.com/parseme/corpora/-/wikis/cupt-format>, an extension of the .conllu format, will be used. Thus, every sentence will contain MWE annotations on top of morphosyntactic annotation. For the Subtask 2 (paraphrasing), we provide a small set of trial data but no training data. All test data will be handcrafted based on subsets of the test data for Subtask 1. Each test sentence will be composed of a single MWE, which will necessarily be a nominal, adjectival or verbal idiom. ADMIRE 2 : The AdMIRe 2.0 dataset, developed for Advancing Multimodal Idiomaticity Representation, supports the study of potentially idiomatic expressions (PIEs) across textual and visual modalities in around 30 languages beyond English. The test data is meticulously created by language experts, with both textual and visual components crafted under strict guidelines. Image generation is carried out using Discord-based tools—primarily Midjourney and the free alternative AdMIRe Bot (Flux-Schnell)—to produce consistent and diverse visual interpretations. Each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including a distractor, and at least two context sentences; one in which the expression is used literally, and one idiomatic. These sentences were obtained from existing corpora or were written specifically for AdMIRe. To ensure the integrity and usability of the dataset, all data undergo rigorous human review and ethical clearance, resulting in a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding. The task will be conducted in a zero-shot setting for all languages except English. Only the English dataset will be released for teams wishing to fine-tune their models or apply few-shot learning techniques. This setup enables participants to evaluate the generalization and cross-lingual transfer capabilities of their systems while minimizing data preparation effort. Evaluation metrics ------------------ PARSEME 2.0 employs specific metrics for its two subtasks, both for performance and diversity : - Subtask 1: MWEs identification - Performance : Precision, Recall and F-measure in two variants, both macro-averages and per-language scores - MWE-based (all tokens of a MWE have to be perfectly identified) - Token-based (partial identification is also rewarded) - Diversity: Two dimensions of the diversity will be measured : variety (how many MWE types a system is able to identify), balance (how evenly a system pays attention to various MWE types) and entropy (which is a hybrid measure between variety and balance). - Subtask 2: Paraphrasing MWEs - Performance: - BERT-score will be used. This measure compares two sentences and gives a score for their similarity. The human experts provide two paraphrases per sentence: a “minimal” (as similar as possible to the source sentence) and a “creative” one (as different as possible from the source). The final score will be the maximum BERT-score between the prediction and the minimal or the creative gold standard. - Manual evaluation: If we have enough human resources, a manual system evaluation will be carried out for the best-performing systems, in order to refine the results. - Diversity: For each language, all paraphrases will be evaluated jointly for variety, balance and entropy. These scores will be integrated into the leaderboard, to reward diverse systems producing the most diverse outcomes. These results will also enable a more in-depth analysis of the differences in diversity between humans and generative systems. ADMIRE 2.0 also employs specific evaluation for the Static Image Ranking task. Competition rankings for the task are based on top image accuracy, with DCG breaking ties. Final rankings in AdMIRe 2.0 will be based on the average top image accuracy across all languages, with each language contributing equally to the final score. - Top Image Accuracy: Measures whether the most representative image is correctly identified. - Normalized Discounted Cumulative Gain (NDCG): Assesses the ranking quality, using a new weighting of [3, 1, 0, 0, 0] for the five image positions to avoid penalizing systems for permuting the order of low-relevance images. Important dates ----------------- - [5 SEPTEMBER] Publishing trial data and the baselines - [1 OCTOBER] Training data released - [15 DECEMBER] Publication of test blind data - [19 DECEMBER] Submission of system predictions - [10 JANUARY] Systems evaluated - [JANUARY] Submission deadline for system description papers - [24-29 MARCH 2026: EACL] MWE workshop (to confirm) Organizing team --------------- PARSEME 2.0 : - Manon Scholivet, Université Paris Saclay, LISN, FR - Takuya Nakamura, Université Paris Saclay, LISN, FR - Agata Savary, Université Paris Saclay, LISN, FR - Éric Bilinski, Université Paris Saclay, LISN, FR - Carlos Ramisch, Aix-Marseille Université, LIS, FR ADMIRE 2 : - Adriana Pagano <https://scholar.google.com/citations?user=iMOX_EQAAAAJ&hl=en&oi=ao>, Universidade Federal de Minas Gerais, BR - Aline Villavicencio <https://sites.google.com/view/alinev>, University of Exeter, UK - Dilara Torunoğlu Selamet <https://scholar.google.com/citations?user=mkpbvoAAAAAJ&hl=en>, Istanbul Technical University, TR - Doğukan Arslan <https://scholar.google.com/citations?user=8Lc2J1cAAAAJ&hl=en&oi=ao>, Istanbul Technical University, TR - Gülşen Eryiğit <https://scholar.google.com/citations?user=25CpSdkAAAAJ&hl=en&oi=ao>, Istanbul Technical University, TR - Rodrigo Wilkens <https://scholar.google.com/citations?user=-sIkqlEAAAAJ&hl=en>, University of Exeter, UK - Wei He <https://scholar.google.com.hk/citations?user=3BWaQ4cAAAAJ&hl=zh-CN>, University of Exeter, UK