Dear colleagues,

you are invited  to participate in the Eval4NLP 2023 shared task on **Prompting 
Large Language Models as Explainable Metrics**. 

Please find more information below and on the shared task webpage: 
https://eval4nlp.github.io/2023/shared-task.html

Important Dates

- Shared task announcement: August 02, 2023
- Dev phase: August 07, 2023
- Test phase: September 18, 2023
- System Submission Deadline: September 23, 2023
- System paper submission deadline: October 5, 2023
- System paper camera ready submission deadline: October 12, 2023

All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”). The timeframe of the 
test phase may change. Please regularly check the shared task webpage:  
https://eval4nlp.github.io/2023/shared-task.html.

** Overview **

With groundbreaking innovations in unsupervised learning and scalable 
architectures the opportunities (but also risks) of automatically generating 
audio, images, video and text, seem overwhelming. Human evaluations of this 
content are costly and are often infeasible to collect. Thus, the need for 
automatic metrics that reliably judge the quality of generation systems and 
their outputs, is stronger than ever. Current state-of-the-art metrics for 
natural language generation (NLG) still do not match the performance of human 
experts. They are mostly based on black-box language models and usually return 
a single quality score (sentence-level), making it difficult to explain their 
internal decision process and their outputs.

The release of APIs to large language models (LLMs), like ChatGPT and the 
recent open-source availability of LLMs like LLaMA has led to a boost of 
research in NLP, including LLM-based metrics. Metrics like GEMBA [*] explore 
the prompting of ChatGPT and GPT4 to directly leverage them as metrics. 
Instructscore [*] goes in a different direction and finetunes a LLaMA model to 
predict a fine grained error diagnosis of machine translated content. We notice 
that current work (1) does not systematically evaluate the vast amount of 
possible prompts and prompting techniques for metric usage, including, for 
example, approaches that explain a task to a model or let the model explain a 
task itself, and (2) rarely evaluates the performance of recent open-source 
LLMs, while their usage is incredibly important to improve the reproducibility 
of metric research, compared to closed-source metrics.

This year’s Eval4NLP shared task, combines these two aspects. We provide a 
selection of open-source, pre-trained LLMs. The task is to develop strategies 
to extract scores from these LLM’s that grade machine translations and 
summaries. We will specifically focus on prompting techniques, therefore, 
fine-tuning of the LLM’s is not allowed.

Based on the submissions, we hope to explore and formalize prompting approaches 
for open-source LLM-based metrics and, with that, help to improve their 
correlation to human judgements. As many prompting techniques produce 
explanations as a side product we hope that this task will also lead to more 
explainable metrics. Also, we want to evaluate which of the selected 
open-source models provide the best capabilities as metrics, thus, as a base 
for fine-tuning.

** Goals **

The shared task has the following goals:

Prompting strategies for LLM-based metrics: We want to explore which prompting 
strategies perform best for LLM-based metrics. E.g., few-shot prompting [*], 
where examples of other solutions are given in a prompt, chain-of-thought 
reasoning (CoT) [*], where the model is prompted to provide a multi-step 
explanation itself, or tree-of-thought prompting [*], where different 
explanation paths are considered, and the best is chosen. Also, automatic 
prompt generation might be considered [*]. Numerous other recent works explore 
further prompting strategies, some of which use multiple evaluation passes.

Score aggregation for LLM-based metrics: We also want to explore which 
strategies best aggregate the model scores from LLM-based metrics. E.g., scores 
might be extracted as the probability of a paraphrase being created [*], or 
they could be extracted from LLM output directly [*].

Explainability for LLM-based metrics: We want to analyze whether the metrics 
that provide the best explanations (for example with CoT) will achieve the 
highest correlation to human judgements. We assume that this is the case, due 
to the human judgements being based on fine-grained evaluations themselves 
(e.g. MQM for machine translation)

** Task Description **

The task will consist of building a reference-free metric for machine 
translation and/or summarization that predicts sentence-level quality scores 
constructed from fine-grained scores or error labels. Reference-free means that 
the metric rates the provided machine translation solely based on the provided 
source sentence/paragraph, without any additional, human written references. 
Further, we note that many open-source LLMs have mostly been trained on English 
data, adding further challenges to the reference-free setup.

To summarize, the task will be structured as follows:

- We provide a list of allowed LLMs from Huggingface
- Participants should use prompting to use these LLMs as metrics for MT and 
summarization
- Fine-tuning of the selected model(s) is not allowed
- We will release baselines, which participants might build upon
- We will provide a CodaLab dashboard to compare participants' solutions to 
others

We plan to release a CodaLab submission environment together with baselines and 
dev set evaluation code successively until August 7.

We will allow specific models from Huggingface, please refer to the webpage for 
more details: https://eval4nlp.github.io/2023/shared-task.html

Best wishes,

The Eval4NLP organizers 

[*] References are listed on the shared task webpage: 
https://eval4nlp.github.io/2023/shared-task.html
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to