CALL FOR PAPERS
Intrinsic and Extrinsic Evaluation Measures
for MT and/or Summarization
Workshop at the Annual Meeting of
the Association of Computational Linguistics (ACL 2005)
Ann Arbor, Michigan
June 29, 2005
http://www.isi.edu/~cyl/MTSE2005/
This one-day workshop will focus on the challenges that the MT
and summarization communities face in developing valid and useful
evaluation measures. Our aim is to bring these two communities
together to learn from each other's approaches.
In the past few years, we have witnessed---in both MT and
summarization evaluation---the innovation of ngram-based intrinsic
metrics that automatically score system-outputs against
human-produced reference documents (e.g., IBM's BLEU and ISI/USC's
counterpart ROUGE). Similarly, there has been renewed interest in
user applications and task-based extrinsic measures in both
communities (e.g., DUC'05 and TIDES'04). Most recently, evaluation
efforts have tested for correlations to cross-validate
independently derived intrinsic and extrinsic assessments of
system-outputs with each other and with human judgments on output,
such as accuracy and fluency.
The concrete questions that we hope to see addressed in this
workshop include, but are not limited to:
- How adequately do intrinsic measures capture the variation
between system-outputs and human-generated reference documents
(summaries or translations)? What methods exist for calibrating
and controlling the variation in linguistic complexity and content
differences in input test-sets and reference sets?
How much variation exists within these constructed sets?
How does that variation affect different intrinsic measures?
How many reference documents are needed for effective scoring?
- How can intrinsic measures go beyond simple n-gram matching, to
quantify the similarity between system-output and human-references?
What other features and weighting alternatives lead to better
metrics for both MT and summarization? How can intrinsic measures
capture fluency and adequacy? Which types of new intrinsic metrics
are needed to adequately evaluate non-extractive summaries and
paraphrasing (e.g.,interlingual) translations?
- How effectively do extrinsic (or proxy extrinsic) measures capture the
quality of system output, as needed for downstream use in human tasks,
such as triage (document relevance judgments), extraction (factual
question answering), and report writing; and in automated tasks,
such as filtering, information extraction, and question-answering?
For example, when is an MT system good enough that a summarization
system benefits from the additional information available in
the MT output?
- How should metrics for MT and summarization be assessed and
compared? What characteristics should a good metric possess?
When is one evaluation method better than another? What are the
most effective ways of assessing the correlation testing and
statistical modeling that seek to predict human task performance
or human notions of output quality (e.g., fluency and adequacy)
from "cheaper" automatic metrics? How reliable are human judgments?
Anyone with an interest in MT or summarization evaluation research or
in issues pertaining to the combination of MT and summarization is
encouraged to participate in the workshop. We are looking for research
papers on the aforementioned topics, as well as position papers that
identify limitations in current approaches and describe promising
future research directions.
SHARED DATA SETS
To facilitate the comparison of different measures during the
workshop, we will be making available data sets in advance for
workshop participants to test their approaches to evaluation.
For the details for accessing the data sets, please go to workshop's
website at http://www.isi.edu/~cyl/MTSE2005.
WORKSHOP FORMAT
The workshop will include presentations of research papers
and short reports, an invited report on the TIDES 2005 Multi-lingual,
multi-document summarization evaluation, and significant discussion
time to compare results of different researchers. The workshop
will conclude with a panel of invited discussants to address future
research directions.
TARGET AUDIENCE
The topic of this workshop should be of significant interest to
the entire MT and Summarization research communities, and also to
commercial developers of MT and Summarization systems. It should be
of particular interest to the program managers and participants of the
MT and Summarization programs funded by the US Government, where
common evaluations are an integral part of the research program.
SUBMISSION INFORMATION
Submissions will consist of regular full papers, reports on evaluations
using shared data sets, and position papers, formatted following the
ACL 2005 guidelines. Details for submission will be posted on the
workshop website. The submission and review processes will be
handled electronically.
IMPORTANT DATES
All submissions due: Mon, May 2, 2005
Notification: Sun, May 22, 2005
Camera-ready papers due: Wed, June 1, 2005
ORGANIZERS
Jade Goldstein, US Department of Defense, USA
Alon Lavie, Language Technologies Institute, CMU, USA
Chin-Yew Lin, Information Sciences Institute, USC, USA
Clare Voss, Army Research Laboratory, USA
PROGRAM COMMITTEE
Yasuhiro Akiba (ATR, Japan)
Leslie Barrett (TransClick, USA)
Bonnie Dorr (U Maryland, USA)
Tony Hartley (U Leeds, UK)
John Henderson (MITRE, USA)
Chiori Hori (LTI CMU, USA)
Eduard Hovy (ISI/USC, USA)
Doug Jones (MIT Lincoln Laboratory, USA)
Philipp Koehn (CSAIL MIT, USA)
Marie-Francine Moens (Katholieke Universiteit, Leuven, Belgium)
Hermann Ney (RWTH Aachen, Germany)
Franz Och (Google, USA)
Becky Passonneau (Columbia U, NY USA)
Andrei Popescu-Belis (ISSCO/TIM/ETI, U Geneva, Switzerland)
Dragomir Radev (U Michigan, USA)
Karen Sparck Jones (Computer Laboratory, Cambridge U, UK)
Simone Teufel (Computer Laboratory, Cambridge U, UK)
Nicola Ueffing (RWTH Aachen, Germany)
Hans van Halteren (U Nijmegen, The Netherlands)
Michelle Vanni (ARL, USA)
Dekai Wu (HKUST, Hong Kong)
_______________________________________________
Mt-list mailing list