[Moses-support] New parallel corpus release: OpenSubtitles2016

Jorg Tiedemann Tue, 15 Mar 2016 11:57:13 -0700

OpenSubtitles2016

We just released a major update of the parallel subtitle corpus in OPUS:
http://opus.lingfil.uu.se/OpenSubtitles2016.php 
<http://opus.lingfil.uu.se/OpenSubtitles2016.php>


2.8 million subtitle files in 60 languages with a total of over 17 billion 
tokens in 2.6 billion sentences and sentence fragments.
As usual in OPUS all languages are sentence-aligned creating a total of 1,689 
bitexts.
The data sets are provided in standalone XML format with standoff sentence 
alignment, TMX and aligned plain text format (often used in training SMT 
models).

More information is available in:
Pierre Lison and Jörg Tiedemann, 2012, OpenSubtitles2016: Extracting Large 
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th 
International Conference on Language Resources and Evaluation (LREC 2016)
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf 
<http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>


In addition, we also provide intra-lingual alignments between alternative 
subtitles in the same language:
http://opus.lingfil.uu.se/OpenSubtitles2016alt.php 
<http://opus.lingfil.uu.se/OpenSubtitles2016alt.php>

More information about those alignments and how they are sorted into various 
categories can be found in:
Jörg Tiedemann, 2012, Finding Alternative Translations in a Large Corpus of 
Movie Subtitles.
In Proceedings of the 10th International Conference on Language Resources and 
Evaluation (LREC 2016)
http://stp.lingfil.uu.se/~joerg/paper/lrec2016.pdf 
<http://stp.lingfil.uu.se/~joerg/paper/lrec2016.pdf>

Note, that all data sets are automatically created using various pre-processing 
and alignment tools. 
There will be problems at various levels. Feedback is very welcome!


Other new data sets in OPUS:

News Commentary version 11 (originally provided by CASMACAT):
http://opus.lingfil.uu.se/News-Commentary11.php 
<http://opus.lingfil.uu.se/News-Commentary11.php>
Different to the original source, this release is truly multilingual with 
alignments across all languages.

Global Voices (also provided by CASMACAT):
http://opus.lingfil.uu.se/GlobalVoices.php 
<http://opus.lingfil.uu.se/GlobalVoices.php>
Again, this version is multilingual.

Wikipedia:
A corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and 
Krzysztof Marasek. More information: Krzysztof Wołk and Krzysztof Marasek: 
Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel 
Sentence Pairs., Procedia Technology, 18, Elsevier, p.126-132, 2014
http://www.sciencedirect.com/science/article/pii/S2212017314005453 
<http://www.sciencedirect.com/science/article/pii/S2212017314005453>


For more information on OPUS:
http://opus.lingfil.uu.se/index.php <http://opus.lingfil.uu.se/index.php>
Select the language pair you are interested in to see all resources that are 
available for that particular language pair.
Data formats are explained here: http://opus.lingfil.uu.se/trac 
<http://opus.lingfil.uu.se/trac>

Enjoy!


**********************************************************************************
Jörg Tiedemann
Department of Modern Languages             http://www.helsinki.fi/~tiedeman/ 
<http://www.helsinki.fi/~tiedeman/>
University of Helsinki

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] New parallel corpus release: OpenSubtitles2016

Reply via email to