ReproGen shared task will be running again this year!

You can try to reproduce some of the published results from the list or to
reproduce some of your own results (whichever you want).

Reproducing both human as well as automatic evaluations are welcome.

Deadline for submitting reports: 6 June 2022


Background

Across Natural Language Processing (NLP), a growing body of work is
exploring the issue of reproducibility in machine learning contexts. The
field is currently far from having a generally agreed tool box of methods
for defining and assessing reproducibility. Reproducibility of results of
human evaluation experiments is particularly under-addressed which is of
concern for Natural Language Generation in particular, where human
evaluation is common. More generally, human evaluations provide the
benchmarks against which automatic evaluation methods are assessed across
NLP, and are moreover widely regarded as the standard form of evaluation in
NLG.

Following the organisation of the first ReproGen shared task
<https://reprogen.github.io/2021/> on reproducibility of human evaluations
in NLG as part of Generation Challenges (GenChal) at INLG’21, we are
organising the second ReproGen shared task on reproducibility of
evaluations in NLG with a widened scope, now extending to both human and
automatic evaluation results. As with the first ReproGen shared task, our
aim is (i) to shed light on the extent to which past NLG evaluations have
been reproducible, and (ii) to draw conclusions regarding how NLG
evaluations can be designed and reported to increase reproducibility. If
the task is run over several years, we hope to be able to document an
overall increase in levels of reproducibility over time.
About ReproGen

We are once again organising ReproGen with two tracks, one an ‘unshared
task’ in which teams attempt to reproduce their own prior automatic or
human evaluation results (Track B below), the other a shared task in which
teams repeat existing automatic and human evaluation studies with the aim
of reproducing their results (Track A):

*A. Main Reproducibility Track*: For a shared set of selected evaluation
studies (see below), participants repeat one or more of the studies, and
attempt to reproduce their results, using published information plus
additional information and resources provided by the authors, and making
common-sense assumptions where information is still incomplete.

*B. RYO Track*: Reproduce Your Own previous automatic or human evaluation
results, and report what happened. Unshared task.
Track A Papers

We have selected the papers listed below for inclusion in ReproGen Track A,
four of which were offered in Track A last year. The authors have agreed to
evaluation studies from their papers as identified below to be used for
reproduction studies. In all cases, the system outputs to be evaluated and
any reusable tools that were used in evaluations are available. We also
have available completed ReproGen Human Evaluation Sheets which we will use
as the standard for establishing similarity between different human
evaluation studies.

The papers and studies, with many thanks to the authors for supporting
ReproGen, are:

*van der Lee et al. (2017): PASS: A Dutch data-to-text system for soccer,
targeted towards specific audiences <https://aclanthology.org/W17-3513.pdf>
[1 evaluation study; Dutch; 20 evaluators; 1 quality criterion;
reproduction target: primary scores]*

*Dušek et al. (2018): Findings of the E2E NLG Challenge
<https://aclanthology.org/W18-6539.pdf> [1 evaluation study; English;
MTurk; 2 quality criteria; reproduction target: primary scores]*

*Qader et al. (2018): Generation of Company descriptions using
concept-to-text and text-to-text deep models: dataset collection and
systems evaluation <https://aclanthology.org/W18-6532.pdf> [1 evaluation
study; English; 19 evaluators; 4 quality criteria; reproduction target:
primary scores]*

*Santhanam & Shaikh (2019): Towards Best Experiment Design for Evaluating
Dialogue System Output <https://aclanthology.org/W19-8610.pdf> [3
evaluation studies differing in experimental design; English; 40
evaluators; 2 quality criteria; reproduction target: correlation scores
between 3 studies]*

*Nisioi et al. (2017): Exploring Neural Text Simplification Models
<https://aclanthology.org/P17-2014.pdf> [one automatic evaluation study;
reproduction target: two automatic scores]; [one human evaluation study; 70
sentences; 9 system outputs; 4 quality criteria; reproduction target:
primary scores]*
Track A and B Instructions

Step 1. Fill in the registration form <https://forms.gle/TFK9TWDetBYhwNov5>,
indicating which of the above papers, or which of your own papers, you wish
to carry out a reproduction study for.

Step 2. The ReproGen participants information will be made available to
you, plus data, tools and other materials for each of the studies you have
selected in the registration form.

Step 3. Carry out the reproduction, and submit a report of up to 8 pages
plus references and supplementary material including a completed ReproGen
Human Evaluation Data Sheet Light (HEDS-Light, a new, simplified version of
the original HEDS) for each reproduction study, by 1 June 2022.

Step 4. The organisers will carry out light touch review of the evaluation
reports according to the following criteria:

   - HEDS-Light evaluation sheet has been completed.
   - Exact repetition of study has been attempted and is described in the
   report.
   - Report gives full details of the reproduction study, in accordance
   with the reporting guidelines provided.
   - All tools and resources used in the study are publicly available.

Step 5. Present paper at the results meeting.

Reports will be included in the INLG’22 proceedings and results will be
presented in the GenChal’22 session at INLG
<https://inlgmeeting.github.io/index.html>. Full details and instructions
will be provided in the ReproGen participants information.
Important Dates

   - 28 January 2022: First Call for Participation and registration opens
   - 06 June 2022: Submission deadline for reproduction reports
   - 12 June 2022: Reviews and feedback to authors
   - 20 June 2022: Camera-ready papers due
   - 18-22 July 2022: Results presented at INLG
   <https://inlgmeeting.github.io/index.html>

Organisers

Anya Belz, ADAPT/DCU, Ireland
Maja Popović, ADAPT/DCU, Ireland
Anastasia Shimorina, Orange, Lannion, France
Ehud Reiter, University of Aberdeen, UK
Contact

[email protected]
https://reprogen.github.io
_______________________________________________
Mt-list site list
[email protected]
https://lists.eamt.org/cgi-bin/mailman/listinfo/mt-list

Reply via email to