Hi everyone,
If you’ve been around a while, you are probably aware of how hard it can be to
find and cite old MT papers. Many of these can only be found on the MT Archive,
which has not been maintained for some years.
https://www.aclweb.org/anthology/
http://www.mt-archive.info
As Director of the ACL Anthology, I am looking for someone to help move the MT
Archive into the ACL Anthology. This conversion is a paid position (with
funding from IAMT) with a goal completion date of April 15, 2020, so that the
results can be demonstrated at EAMT.
I am personally very excited about this conversion project. We’ve put a lot of
work into the Anthology over the past year, and all of this could come together
very quickly. It is satisfying to watch the ingestions and changes go live, and
putting this wealth of data in a place where it can be easily searched,
exported, and cited will be immensely satisfying!
If you are interested, please contact me. You can see more information in the
job advertisement below.
# Seeking assistance to help in the conversion of the Machine Translation
Archive
February 6, 2020
The Association for Computational Linguistics (ACL) is seeking assistance in
the task of ingesting the Machine Translation Archive (www.mt-archive.info)
into the ACL Anthology (www.aclweb.org/anthology). This job is funded by the
International Association for Machine Translation (IAMT) with the goal of
preserving and disseminating the wealth of information present in the Archive,
much of it which is exclusively there.
## Job Description
The Machine Translation Archive (hereafter, “Archive”) was created by John
Hutchins in 2004 and currently contains about 12,000 entries. All of the
archive, including various portals and indexes, is hand-crafted HTML written
using Microsoft Word, and all of the papers are stored as PDF files. It is the
single most important source of papers about machine translation, with emphasis
on historical MT papers.
The main task is to convert the information in the MT Archive into the XML
format used by the Anthology. The steps, which will be done in close
collaboration with the Anthology Director, are:
• Producing a spreadsheet of conference proceedings and journals in the
MT Archive, and obtaining identifiers for each of them from the Anthology team.
• Semi-automatically transforming each of these proceedings into the
XML metadata format used by the Anthology. This will only include abstracts
when they have already been extracted from the PDFs in the Archive.
• Renaming all the PDFs into the format required by the Anthology.
• Where not already extant, incorporating the conference program into
the frontmatter (for example, for AMTA 2008)
• (Time-permitting) Converting the following additional
manually-curated metadata from the Archive into a structured object that refers
to the new Anthology identifiers.
• Languages and language pairs
• System and project names
• Organizations and Affiliations
• Methods, techniques, applications, and uses
We hope to complete the conversion by April 15, 2020. Hourly salary will be
negotiated at time of hiring. Timesheets will be signed and approved by the
Anthology Director and paid biweekly from the ACL.
To apply, please send an email to [email protected], with a subject of
“Application for the MT Archive Ingestion Position”. In the body of the email,
please provide the following information:
• Personal Information: A curriculum vitae.
• Job Times: When you are able to start working; hours available per
week; estimated completion date.
• Qualifications: A paragraph describing your qualifications; an email
address for one or two references.
• Plan: A paragraph or two summarizing your intended technical approach.
## Appendix: Detailed Information
### Main XML format
The Anthology repository is open-sourced and is hosted online at
https://github.com/acl-org/acl-anthology. The paper metadata for the Anthology
is hosted in the data/xml directory, with XML files roughly corresponding to
events. For example, the proceedings of ACL 2019 are in data/xml/P19.xml, and
look like this:
<?xml version='1.0' encoding='UTF-8'?>
<collection id="P19">
<volume id="1" ingest-date="2019-07-28">
<meta>
<booktitle>Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics</booktitle>
<url>P19-1</url>
<editor><first>Anna</first><last>Korhonen</last></editor>
<editor><first>David</first><last>Traum</last></editor>
<editor><first>Lluís</first><last>Màrquez</last></editor>
<publisher>Association for Computational Linguistics</publisher>
<address>Florence, Italy</address>
<month>July</month>
<year>2019</year>
</meta>
<frontmatter>
<url>P19-1000</url>
</frontmatter>
<paper id="1">
<title>One Time of Interaction May Not Be Enough: Go Deep with an
Interaction-over-Interaction Network for Response Selection in Dialogues</title>
<author><first>Chongyang</first><last>Tao</last></author>
<author><first>Wei</first><last>Wu</last></author>
<author><first>Can</first><last>Xu</last></author>
<author><first>Wenpeng</first><last>Hu</last></author>
<author><first>Dongyan</first><last>Zhao</last></author>
<author><first>Rui</first><last>Yan</last></author>
<pages>1–11</pages>
<abstract>Currently, researchers have paid great attention to
retrieval-based dialogues in open-domain. In particular, people study the
problem by investigating context-response matching for multi-turn response
selection based on publicly recognized benchmark data sets. State-of-the-art
methods require a response to interact with each utterance in a context from
the beginning, but the interaction is performed in a shallow way. In this work,
we let utterance-response interaction go deep by proposing an
interaction-over-interaction network (IoI). The model performs matching by
stacking multiple interaction blocks in which residual information from one
time of interaction initiates the interaction process again. Thus, matching
information within an utterance-response pair is extracted from the interaction
of the pair in an iterative fashion, and the information flows along the chain
of the blocks via representations. Evaluation results on three benchmark data
sets indicate that IoI can significantly outperform state-of-the-art methods in
terms of various matching metrics. Through further analysis, we also unveil how
the depth of interaction affects the performance of IoI.</abstract>
<url>P19-1001</url>
<doi>10.18653/v1/P19-1001</doi>
</paper>
</volume>
</collection>
Events are typically assigned a collection identifier, e.g., “P19”. Within a
collection are volumes (e.g., “1” for main papers, “2” for a demo session, and
so on). Finally, individual papers within a volume are numbered, started at
“1”. For each volume within a collection, there is a top-level <meta> section
containing volume information, following by an XML entry for all papers.
### Metadata formats
The Archive contains additional information beyond proceedings volumes, which
we would also like converted. This includes the four pages linked to above:
Languages and language pairs, System and project names, Organizations and
Affiliations, and Methods, techniques, applications, and uses. Each of these
sections should be converted into a validating YAML format. For example: “Index
of languages: A–D: publications since 2010” should be converted to look
something like the following:
afr-dut:
- [paperid]
- [paperid]
Listing all paper IDs that were originally recorded for that language pair, and
so on for other metadata.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support