We invite the community to participate in a shared task organized in the
context of the CONDA workshop https://conda-workshop.github.io/
<https://conda-workshop.github.io/>.
Data contamination, where evaluation data is inadvertently included in
pre-training corpora of large scale models, and language models (LMs) in
particular, has become a concern in recent times (Sainz et al. 2023
<https://aclanthology.org/2023.findings-emnlp.722/>; Jacovi et al. 2023
<https://aclanthology.org/2023.emnlp-main.308/>). The growing scale of
both models and data, coupled with massive web crawling, has led to the
inclusion of segments from evaluation benchmarks in the pre-training
data of LMs (Dodge et al., 2021
<https://aclanthology.org/2021.emnlp-main.98/>; OpenAI, 2023
<https://arxiv.org/abs/2303.08774>; Google, 2023
<https://arxiv.org/abs/2305.10403>; Elazar et al., 2023
<https://arxiv.org/abs/2310.20707>). The scale of internet data makes it
difficult to prevent this contamination from happening, or even detect
when it has happened (Bommasani et al., 2022
<https://arxiv.org/abs/2108.07258>; Mitchell et al., 2023
<https://arxiv.org/abs/2212.05129>). Crucially, when evaluation data
becomes part of pre-training data, it introduces biases and can
artificially inflate the performance of LMs on specific tasks or
benchmarks (Magar and Schwartz, 2022
<https://aclanthology.org/2022.acl-short.18/>). This poses a challenge
for fair and unbiased evaluation of models, as their performance may not
accurately reflect their generalization capabilities.
The shared task is a community effort on centralized data contamination
evidence collection. While the problem of data contamination is
prevalent and serious, the breadth and depth of this contamination are
still largely unknown. The concrete evidence of contamination is
scattered across papers, blog posts, and social media, and it is
suspected that the true scope of data contamination in NLP is
significantly larger than reported.
With this shared task we aim to provide a structured, centralized
platform for contamination evidence collection to help the community
understand the extent of the problem and to help researchers avoid
repeating the same mistakes. The shared task also gathers evidence of
clean, non-contaminated instances. The platform is already available for
perusal at
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database
<https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report>.
Participants in the shared task need to submit their contamination
evidence (see instructions below). The CONDA 2024 workshop organizers
will review the evidence through pull requests.
*/Compilation Paper/*
As a companion to the contamination evidence platform, we will produce a
paper that will provide a summary and overview of the evidence collected
in the shared task. The participants who contribute to the shared task
will be listed as co-authors in the paper.
*/
/*
*/Instructions for Evidence Submission/*
Each submission should report a case of contamination or lack of
contamination thereof. The submission can be either about (1)
contamination in the corpus used to pre-train language models, where the
pre-training corpus contains a specific evaluation dataset, or about (2)
contamination in a model that shows evidence of having seen a specific
evaluation dataset while being trained. Each submission needs to mention
the corpus (or model) and the evaluation dataset, in addition to some
evidence of contamination. Alternatively, we also welcome evidence of a
lack of contamination.
Reports must be submitted through a Pull Request in the Data
Contamination Report space at HuggingFace. The reports must follow the
Contribution Guidelines provided in the space and will be reviewed by
the organizers. If you have any questions, please contact us at
[email protected]
<mailto:[email protected]> or open a discussion in the
space itself.
URL with contribution guidelines:
https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database
<https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report> (“Contribution
Guidelines” tab)
*/Important dates/*
* Deadline for evidence submission: July 1, 2024
* Workshop day: August 16, 2024
*/Sponsors/*
* AWS AI and Amazon Bedrock
* HuggingFace
* Google
*/Contact/*
* Website: https://conda-workshop.github.io/
<https://conda-workshop.github.io/>
* Email: [email protected]
<mailto:[email protected]>
*/Organizers/*
Oscar Sainz, University of the Basque Country (UPV/EHU)
Iker García Ferrero, University of the Basque Country (UPV/EHU)
Eneko Agirre, University of the Basque Country (UPV/EHU)
Jon Ander Campos, Cohere
Alon Jacovi, Bar Ilan University
Yanai Elazar, Allen Institute for Artificial Intelligence and University
of Washington
Yoav Goldberg, Bar Ilan University and Allen Institute for Artificial
Intelligence
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]