Hi everyone,

If you’ve been around a while, you are probably aware of how hard it can be to 
find and cite old MT papers. Many of these can only be found on the MT Archive, 
which has not been maintained for some years.

        https://www.aclweb.org/anthology/
        http://www.mt-archive.info

As Director of the ACL Anthology, I am looking for someone to help move the MT 
Archive into the ACL Anthology. This conversion is a paid position (with 
funding from IAMT) with a goal completion date of April 15, 2020, so that the 
results can be demonstrated at EAMT.

I am personally very excited about this conversion project. We’ve put a lot of 
work into the Anthology over the past year, and all of this could come together 
very quickly. It is satisfying to watch the ingestions and changes go live, and 
putting this wealth of data in a place where it can be easily searched, 
exported, and cited will be immensely satisfying!

If you are interested, please contact me. You can see more information in the 
job advertisement below.

# Seeking assistance to help in the conversion of the Machine Translation 
Archive

February 6, 2020

The Association for Computational Linguistics (ACL) is seeking assistance in 
the task of ingesting the Machine Translation Archive (www.mt-archive.info) 
into the ACL Anthology (www.aclweb.org/anthology). This job is funded by the 
International Association for Machine Translation (IAMT) with the goal of 
preserving and disseminating the wealth of information present in the Archive, 
much of it which is exclusively there.

## Job Description

The Machine Translation Archive (hereafter, “Archive”) was created by John 
Hutchins in 2004 and currently contains about 12,000 entries. All of the 
archive, including various portals and indexes, is hand-crafted HTML written 
using Microsoft Word, and all of the papers are stored as PDF files. It is the 
single most important source of papers about machine translation, with emphasis 
on historical MT papers.

The main task is to convert the information in the MT Archive into the XML 
format used by the Anthology. The steps, which will be done in close 
collaboration with the Anthology Director, are:

        • Producing a spreadsheet of conference proceedings and journals in the 
MT Archive, and obtaining identifiers for each of them from the Anthology team.
        • Semi-automatically transforming each of these proceedings into the 
XML metadata format used by the Anthology. This will only include abstracts 
when they have already been extracted from the PDFs in the Archive.
        • Renaming all the PDFs into the format required by the Anthology.
        • Where not already extant, incorporating the conference program into 
the frontmatter (for example, for AMTA 2008)
        • (Time-permitting) Converting the following additional 
manually-curated metadata from the Archive into a structured object that refers 
to the new Anthology identifiers.
                • Languages and language pairs
                • System and project names
                • Organizations and Affiliations
                • Methods, techniques, applications, and uses

We hope to complete the conversion by April 15, 2020. Hourly salary will be 
negotiated at time of hiring. Timesheets will be signed and approved by the 
Anthology Director and paid biweekly from the ACL.

To apply, please send an email to [email protected], with a subject of 
“Application for the MT Archive Ingestion Position”. In the body of the email, 
please provide the following information:

        • Personal Information: A curriculum vitae.
        • Job Times: When you are able to start working; hours available per 
week; estimated completion date.
        • Qualifications: A paragraph describing your qualifications; an email 
address for one or two references.
        • Plan: A paragraph or two summarizing your intended technical approach.

## Appendix: Detailed Information

### Main XML format

The Anthology repository is open-sourced and is hosted online at 
https://github.com/acl-org/acl-anthology. The paper metadata for the Anthology 
is hosted in the data/xml directory, with XML files roughly corresponding to 
events. For example, the proceedings of ACL 2019 are in data/xml/P19.xml, and 
look like this:

<?xml version='1.0' encoding='UTF-8'?>
<collection id="P19">
 <volume id="1" ingest-date="2019-07-28">
   <meta>
     <booktitle>Proceedings of the 57th Annual Meeting of the Association for 
Computational Linguistics</booktitle>
     <url>P19-1</url>
     <editor><first>Anna</first><last>Korhonen</last></editor>
     <editor><first>David</first><last>Traum</last></editor>
     <editor><first>Lluís</first><last>Màrquez</last></editor>
     <publisher>Association for Computational Linguistics</publisher>
     <address>Florence, Italy</address>
     <month>July</month>
     <year>2019</year>
   </meta>
   <frontmatter>
     <url>P19-1000</url>
   </frontmatter>
   <paper id="1">
     <title>One Time of Interaction May Not Be Enough: Go Deep with an 
Interaction-over-Interaction Network for Response Selection in Dialogues</title>
     <author><first>Chongyang</first><last>Tao</last></author>
     <author><first>Wei</first><last>Wu</last></author>
     <author><first>Can</first><last>Xu</last></author>
     <author><first>Wenpeng</first><last>Hu</last></author>
     <author><first>Dongyan</first><last>Zhao</last></author>
     <author><first>Rui</first><last>Yan</last></author>
     <pages>1–11</pages>
     <abstract>Currently, researchers have paid great attention to 
retrieval-based dialogues in open-domain. In particular, people study the 
problem by investigating context-response matching for multi-turn response 
selection based on publicly recognized benchmark data sets. State-of-the-art 
methods require a response to interact with each utterance in a context from 
the beginning, but the interaction is performed in a shallow way. In this work, 
we let utterance-response interaction go deep by proposing an 
interaction-over-interaction network (IoI). The model performs matching by 
stacking multiple interaction blocks in which residual information from one 
time of interaction initiates the interaction process again. Thus, matching 
information within an utterance-response pair is extracted from the interaction 
of the pair in an iterative fashion, and the information flows along the chain 
of the blocks via representations. Evaluation results on three benchmark data 
sets indicate that IoI can significantly outperform state-of-the-art methods in 
terms of various matching metrics. Through further analysis, we also unveil how 
the depth of interaction affects the performance of IoI.</abstract>
     <url>P19-1001</url>
     <doi>10.18653/v1/P19-1001</doi>
   </paper>
 </volume>
</collection>

Events are typically assigned a collection identifier, e.g., “P19”. Within a 
collection are volumes (e.g., “1” for main papers, “2” for a demo session, and 
so on). Finally, individual papers within a volume are numbered, started at 
“1”. For each volume within a collection, there is a top-level <meta> section 
containing volume information, following by an XML entry for all papers.

### Metadata formats

The Archive contains additional information beyond proceedings volumes, which 
we would also like converted. This includes the four pages linked to above: 
Languages and language pairs, System and project names, Organizations and 
Affiliations, and Methods, techniques, applications, and uses. Each of these 
sections should be converted into a validating YAML format. For example: “Index 
of languages: A–D: publications since 2010” should be converted to look 
something like the following:

afr-dut:
   - [paperid]
   - [paperid]

Listing all paper IDs that were originally recorded for that language pair, and 
so on for other metadata.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to