WMT 2016 Shared Task on Bilingual Document Alignment CALL FOR PARTICIPATION
======================================================== WMT 2016 Shared Task on Bilingual Document Alignment ======================================================== Website: http://www.statmt.org/wmt16/bilingual-task.html At WMT 2016 (collocated with ACL 2016) Parallel corpora are especially important for statistical machine translation, but so far the collection of such data within the academic research community has been ad hoc and limited in scale. To promote this research problem within we organize a shared task on aligning bilingual documents from crawled web sites. More details can be found below, and on our website: http://www.statmt.org/wmt16/bilingual-task.html Important Dates: Release of training data: February 12, 2016 Release of test data: April 11, 2016 Results submission deadline: May 2, 2016 Paper submission deadline: May 8, 2016 Notification of acceptance: June 5, 2016 Camera-ready deadline: June 22, 2016 ========================= Detailed Task Description ========================= The task is to align French web pages to English web pages for a given crawled webdomain (a set of web pages under a fully qualified domain name - FWDN). TRAINING DATA: For the crawled data we provide one file per webdomain in .lett format adapted from Bitextor. This a plain text format with one line per page. Each line consists of 6 tab-separated values: Language ID (e.g. en) Mime type (always text/html) Encoding (always charset=utf-8) URL HTML in Base64 encoding Text in Base64 encoding We make sure that the language id is reliable, at least for the documents in the train and test pairs. We also ensure that all known pairs have been crawled and no URLs are missing from the crawls. Text extraction was performed using an HTML5 parser. As the original HTML pages are available, participants are welcome to implement their own text extraction, for example to remove boilerplate. To facilitate use of the .lett files we provide a simple reader class in Python. Additionally, we have identified spans of French text for which we produced English translations using MT. These translations are not part of the lett files but provided separately. As part of the training data we provide a set of 1,624 correctly aligned EN-FR pairs from 49 webdomains. The number of pairs per webdomain varies between 4 and over 200. All pairs are from within a single webdomain, possible matches between two different webdomains, e.g. siemens.de and siemens.com, are not considered in this task. Answer keys are given in the format Source_URL<TAB>Target_URL TEST SET: For testing, we will provide additional crawls of new webdomains, distinct from the ones in the training data in the same format. For these no known pairs will be provided. Because the full set of valid document pairs is unknown evaluation we be based entirely on precision on an annotated subset of correctly aligned pairs. Participants are expected to produce a list of possible pairings in the format of the training data. Each source url may be matched with at most one target url and visa-versa. Should a URL occur repeatedly, later occurrences are ignored. We provide an evaluation script to assess performance during development. BASELINE: We provide a simple baseline method based on URL matching. Training data and baseline method are available at http://www.statmt.org/wmt16/bilingual-task.html ORGANIZERS: Christian Buck, University of Edinburgh Philipp Koehn, Johns Hopkins University ACKNOWLEDGMENT: This shared task received support from a Google Faculty Research Award. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
