UniDive shared tasks on idiomaticity and multiword expressions

Carlos Ramisch via Mt-list Mon, 18 Aug 2025 02:27:22 -0700

** CALL FOR PARTICIPATION**

Two peas in a pod: PARSEME 2.0 and ADMIRE 2.0 multilingual UniDive shared
tasks on idiomaticity and multiword expressions


====================================================================


The UniDive COST Action <https://unidive.lisn.upsaclay.fr/> is happy to
announce new versions of the ADMIRE and the PARSEME shared tasks dedicated
to idiomaticity and multiword expressions (MWEs). MWEs are groups of words
that have non-compositional semantics, i.e. their meanings cannot be
straightforwardly deduced from the meanings of their components. For
instance, a bad apple is a person who has a bad influence on others.

Examples of MWEs include idioms: nominal (This child is a bad apple),
adjectival (Most characters in his movies are somewhat larger than life),
adverbial (it happens from time to time), functional (I’ll do it on behalf
of you), verbal (It's raining cats and dogs today!), as well as light verb
constructions (I will pay a visit to my aunt), inherently reflexive verbs (Help
yourself to some cake), etc.

Both shared tasks will take place together during fall and winter of
2025/2026, despite using different resources and modalities. We hope to
co-organise the culminating workshop with SIGLEX-MWE section
<https://multiword.org/> and co-locate it with EACL 2026 in Morocco (24-28
March 2026) but this is still to confirm.

PARSEME 2.0 is a shared task whose main objective is to identify and
paraphrase multiword expressions (MWEs) in written text. Three previous
editions of PARSEME shared tasks (1.0,
<https://aclanthology.org/W17-1704.pdf> 1.1
<https://aclanthology.org/W18-4925.pdf>, 1.2
<https://aclanthology.org/2020.mwe-1.14.pdf>) were dedicated to
identification of MWEs, but have focused on verbal MWEs only. We propose
here to deal with all syntactic types of MWEs. Additionally, we deal with
paraphrasing MWEs, which is closer to understanding than sheer
identification. We propose two subtasks: the first corresponds to the
classical identification task in running text. The second consists in
paraphrasing a sentence containing a MWE, so as to remove idiomaticity.
Data annotation is ongoing and at least 19 languages are expected to be
covered: Albanian, Brazilian Portuguese, Dutch, Egyptian (ca. 2700-2000
BC), French, Georgian, Greek (Ancient), Greek (Modern), Hebrew, Italian,
Japanese, Latvian, Lithuanian, Persian, Polish, Romanian, Serbian, Swedish,
Ukrainian

AdMIRe 2.0 (Advancing Multimodal Idiomaticity Representation) is a shared
task that addresses the challenge of idiomatic language understanding by
evaluating how well computational models interpret potentially idiomatic
expressions (PIEs) using both text and images. This new edition
extends the first
edition of the AdMIRe task <https://arxiv.org/pdf/2503.15358> by broadening
language coverage to include around 30 languages from the UNIDIVE network
and by introducing parallel idiomaticity datasets designed to assess the
cross-lingual and multilingual capabilities of current language
technologies. By evaluating how well images reflect idiomatic versus
literal meanings, AdMIRe 2.0 establishes a richer benchmark for multimodal,
multilingual idiomatic language comprehension. Given a context sentence
containing a PIE and a set of five images, the task is to rank the images
based on how accurately they depict the meaning of the PIE used in that
sentence. The task will be zero-shot for newly introduced languages. Only
the English dataset will be provided to teams wishing to fine-tune their
models or apply few-shot learning techniques. This setup allows
participants to test the generalization and cross-lingual capabilities of
their systems while minimizing data preparation efforts. While the task is
designed to encourage participation from teams working on multimodal
technologies, it also accommodates approaches focused solely on text. To
support broader participation and reduce the complexity and computational
cost for such teams, the organizers will provide automatically generated
descriptive captions for each image, allowing models to rely exclusively on
text input if desired.



Subtasks

------

 PARSEME 2.0 :
Subtask 1 : MWEs identification
The main goal of this subtask is to evaluate the systems' ability to
identify MWEs. Systems must recognize tokens that belong to MWEs in running
text. Participants will be strongly encouraged to take interest in the
diversity of their predictions, as in addition to performance scores
(F-score), the results will be evaluated in terms of the diversity of the
MWEs correctly predicted by the systems.

Subtask 2 : Paraphrasing MWEs
Given a sentence with a MWEs, the task is to generate a new sentence having
the same meaning but not containing this MWE. As a continuation of Subtask
1, this task will test the systems' ability to identify MWEs, but also to
capture their meaning. Here, again, we encourage systems to produce diverse
paraphrases, and diversity measures will be used in evaluation.


ADMIRE 2.0 :

Static Image Ranking

Participants are presented with a context sentence containing a PIE and a
set of five images. The objective is to rank these images based on how well
they visually represent the idiomatic meaning within that specific context.
For each idiom, five different images are generated with a consistent
style. These images cover a range of idiomaticity, including a synonym for
the idiomatic meaning, a synonym for the literal meaning, something related
to the idiomatic meaning but not synonymous, something related to the
literal meaning but not synonymous, and a distractor image that is
unrelated to both meanings.

Which of these images best represents the meaning of the phrase bad apple
in the following sentence?:

"We have to recognize that this is not the occasional bad apple but a
structural, sector-wide problem"

How about here?

"However, if ethylene happens to be around (say from a bad apple), these
fruits do ripen more quickly."

Provided data

-------------

PARSEME 2.0:

The PARSEME community <https://gitlab.com/parseme/corpora/-/wikis> has been
involved in a long-standing effort of universalist modelling and annotation
of MWEs across many languages. Corpora for 30 languages annotated for
verbal MWEs have been published in the past.

For Subtask 1 (identification), we are now in the process of annotating
other syntactic types of MWEs (nominal, adjectival, etc.). The annotation
follows the guidelines
<https://parsemefr.lis-lab.fr/parseme-st-guidelines/2.0/> unified across 30
languages. The training data will contain several hundred to several
thousand annotated MWEs depending on the language. The cupt format
<https://gitlab.com/parseme/corpora/-/wikis/cupt-format>, an extension of
the .conllu format, will be used. Thus, every sentence will contain MWE
annotations on top of morphosyntactic annotation.

For the Subtask 2 (paraphrasing), we provide a small set of trial data but
no training data. All test data will be handcrafted based on subsets of the
test data for Subtask 1. Each test sentence will be composed of a single
MWE, which will necessarily be a nominal, adjectival or verbal idiom.


ADMIRE 2 :

The AdMIRe 2.0 dataset, developed for Advancing Multimodal Idiomaticity
Representation, supports the study of potentially idiomatic expressions
(PIEs) across textual and visual modalities in around 30 languages beyond
English. The test data is meticulously created by language experts, with
both textual and visual components crafted under strict guidelines. Image
generation is carried out using Discord-based tools—primarily Midjourney
and the free alternative AdMIRe Bot (Flux-Schnell)—to produce consistent
and diverse visual interpretations. Each PIE is accompanied by five images
representing a spectrum from idiomatic to literal meanings, including a
distractor, and at least two context sentences; one in which the expression
is used literally, and one idiomatic. These sentences were obtained from
existing corpora or were written specifically for AdMIRe.  To ensure the
integrity and usability of the dataset, all data undergo rigorous human
review and ethical clearance, resulting in a high-quality benchmark for
evaluating multilingual and multimodal idiomatic language understanding.

The task will be conducted in a zero-shot setting for all languages except
English. Only the English dataset will be released for teams wishing to
fine-tune their models or apply few-shot learning techniques. This setup
enables participants to evaluate the generalization and cross-lingual
transfer capabilities of their systems while minimizing data preparation
effort.



Evaluation metrics

------------------

PARSEME 2.0 employs specific metrics for its two subtasks, both for
performance and diversity :

   -

   Subtask 1: MWEs identification
   -

      Performance : Precision, Recall  and F-measure in two variants, both
      macro-averages and per-language scores
      -

         MWE-based (all tokens of a MWE have to be perfectly identified)
         -

         Token-based (partial identification is also rewarded)
         -

      Diversity: Two dimensions of the diversity will be measured : variety
      (how many MWE types a system is able to identify), balance (how evenly a
      system pays attention to various MWE types) and entropy (which
is a hybrid
      measure between variety and balance).

      -

   Subtask 2: Paraphrasing MWEs
   -

      Performance:
      -

         BERT-score will be used. This measure compares two sentences and
         gives a score for their similarity. The human experts provide two
         paraphrases per sentence: a “minimal” (as similar as possible
to the source
         sentence) and a “creative” one (as different as possible from
the source).
         The final score will be the maximum BERT-score between the
prediction and
         the minimal or the creative gold standard.
         -

         Manual evaluation: If we have enough human resources, a manual
         system evaluation will be carried out for the best-performing
systems, in
         order to refine the results.
         -

      Diversity: For each language, all paraphrases will be evaluated
      jointly for variety, balance and entropy. These scores will be integrated
      into the leaderboard, to reward diverse systems producing the
most diverse
      outcomes. These results will also enable a more in-depth analysis of the
      differences in diversity between humans and generative systems.

ADMIRE 2.0 also employs specific evaluation for the Static Image Ranking
task. Competition rankings for the task are based on top image accuracy,
with DCG breaking ties. Final rankings in AdMIRe 2.0 will be based on the
average top image accuracy across all languages, with each language
contributing equally to the final score.

   -

   Top Image Accuracy: Measures whether the most representative image is
   correctly identified.
   -

   Normalized Discounted Cumulative Gain (NDCG): Assesses the ranking
   quality, using a new weighting of [3, 1, 0, 0, 0] for the five image
   positions to avoid penalizing systems for permuting the order of
   low-relevance images.


Important dates

-----------------

   -

   [5 SEPTEMBER] Publishing trial data and the baselines
   -

   [1 OCTOBER] Training data released
   -

   [15 DECEMBER] Publication of test blind data
   -

   [19 DECEMBER] Submission of system predictions
   -

   [10 JANUARY] Systems evaluated
   -

   [JANUARY] Submission deadline for system description papers
   -

   [24-29 MARCH 2026: EACL] MWE workshop (to confirm)

Organizing team

---------------


PARSEME 2.0 :

   -

   Manon Scholivet, Université Paris Saclay, LISN, FR
   -

   Takuya Nakamura, Université Paris Saclay, LISN, FR
   -

   Agata Savary, Université Paris Saclay, LISN, FR
   -

   Éric Bilinski, Université Paris Saclay, LISN, FR
   -

   Carlos Ramisch, Aix-Marseille Université, LIS, FR


ADMIRE 2 :

   -

   Adriana Pagano
   <https://scholar.google.com/citations?user=iMOX_EQAAAAJ&hl=en&oi=ao>,
   Universidade Federal de Minas Gerais, BR
   -

   Aline Villavicencio <https://sites.google.com/view/alinev>, University
   of Exeter, UK
   -

   Dilara Torunoğlu Selamet
   <https://scholar.google.com/citations?user=mkpbvoAAAAAJ&hl=en>, Istanbul
   Technical University, TR
   -

   Doğukan Arslan
   <https://scholar.google.com/citations?user=8Lc2J1cAAAAJ&hl=en&oi=ao>,
   Istanbul Technical University, TR
   -

   Gülşen Eryiğit
   <https://scholar.google.com/citations?user=25CpSdkAAAAJ&hl=en&oi=ao>,
   Istanbul Technical University, TR
   -

   Rodrigo Wilkens
   <https://scholar.google.com/citations?user=-sIkqlEAAAAJ&hl=en>,
   University of Exeter, UK
   -

   Wei He
   <https://scholar.google.com.hk/citations?user=3BWaQ4cAAAAJ&hl=zh-CN>,
   University of Exeter, UK

UniDive shared tasks on idiomaticity and multiword expressions

Reply via email to