Parallel FDA5 WMT'14 Datasets Dear moses-list,
We make the English, Czech, French, German, and Russian datasets we used when building Parallel FDA5 Moses SMT systems for research purposes, available at: https://github.com/bicici/ParFDA5WMT Results are presented in the citation provided below. Citation: Ergun Biçici, Qun Liu, and Andy Way. Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, USA, June 2014. Association for Computational Linguistics. The datasets and the SMT results can serve as a benchmark for SMT research where further linguistic processing can be performed to see whether the results can be improved. The datasets allow fast deployment of accurate SMT systems and can be used for benchmarking the performance of SMT systems. Language models were built using SRILM ( http://www.speech.sri.com/projects/srilm/). Language model corpora used contain 15M sentences some of which are selected from LDC Gigaword corpora by the Parallel FDA5 algorithm: [4 use the LDC English Gigaword 5th edition] - Czech - English: 2.13 million sentences from LDC English Gigaword, ~1.69 %. - French - English: 2.49 million sentences from LDC English Gigaword, ~1.97 % - German - English: 2.57 million sentences from LDC English Gigaword, ~2.03 % - Russian - English: 3.34 million sentences from LDC English Gigaword, ~2.64 % [1 use the LDC French Gigaword 3rd edition] - English - French: 0.47 million sentences from LDC French Gigaword, ~1.93 % Work using the datasets: ------------------------ - Ergun Biçici, Qun Liu, and Andy Way. Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, USA, June 2014. Association for Computational Linguistics. - Ergun Biçici (contributor), “Quality Estimation for Extending Good Translations”, QTLaunchPad Deliverable: http://www.qt21.eu/launchpad/deliverable/quality-estimation-extending-good-translations - Ergun Biçici, High Quality Machine Translation with ITERPE, 2014. Note: Dublin City University Invention Disclosure. LICENSE: CNGL License for Open Data allowing use for research and academic purposes. Best Regards, Ergun Ergun Biçici, CNGL, School of Computing, DCU, www.cngl.ie, Phone: +353-1-700-6711 http://www.computing.dcu.ie/~ebicici/
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
