The DGT translation memories are truly multilingually aligned:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory 
<https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory>

Otherwise you would have several multilingual corpora in OPUS even though they 
are all bilingually aligned. In most cases it is quite straightforward to 
combine the sentence alignments to extract multilingual corpora. I would 
recommend that you'd use the standoff annotation of the sentence alignments in 
OPUS (ces files). I have a simple script that does that:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi 
<http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi>

After that you could use this script to convert to Moses format if you like:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses 
<http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses>

The biggest multilingual corpus would be OpenSubtitles:
http://opus.lingfil.uu.se/OpenSubtitles2016.php 
<http://opus.lingfil.uu.se/OpenSubtitles2016.php>

But here the problem is that it’s not always the same subtitle version that is 
aligned for each language pair. I have just recently compiled the intra-lingual 
alignments for different subtitle variants. Those links could be used to 
transitively map to other languages again, but for this you have to implement 
your own tool. Intra-lingual links are here:
http://opus.lingfil.uu.se/OpenSubtitles2016alt.php 
<http://opus.lingfil.uu.se/OpenSubtitles2016alt.php>
(in the column “all”)

The coverage is very different for various languages. You should select 
languages for which you would expect to have a good coverage of the same movies 
and their subtitles. It’s a bit complicated but should be doable.


All the best and good luck,
Jörg

**********************************************************************************
Jörg Tiedemann
Department of Modern Languages             http://www.helsinki.fi/~tiedeman/ 
<http://www.helsinki.fi/~tiedeman/>
University of Helsinki

> On 22 Jan 2016, at 17:55, Lane Schwartz <dowob...@gmail.com 
> <mailto:dowob...@gmail.com>> wrote:
> 
> Marcin,
> 
> That sounds great! Yes, please do make an announcement. I would definitely 
> make use of such a multi-aligned corpus.
> 
> Lane
> 
> 
> On Fri, Jan 22, 2016 at 9:36 AM, Marcin Junczys-Dowmunt <junc...@amu.edu.pl 
> <mailto:junc...@amu.edu.pl>> wrote:
> Hi Graham,
> At the UN we are now working to release an official version of our data. As a 
> bonus to the pair-wise alignment, it will contain a 6-way fully aligned 
> subcorpus for English, French, Spanish, Russian, Chinese, Arabic; about 13M 
> segments per language. We are waiting for some LREC feedback and the official 
> greenlight from UN officials, but that should be a matter of a couple of 
> weeks now (maybe one, maybe two, maybe four). Once it is ready I can make an 
> announcement here.
> Best,
> Marcin 
> 
> W dniu 22.01.2016 o 16:26, Graham Neubig pisze:
>> Dear Moses Mailing List,
>> 
>> This is not directly related to Moses, but I was wondering if there are any 
>> high-quality, multi-lingually sentence aligned corpora available (i.e. 3 or 
>> more languages with aligned sentences). We're aware of the Europarl and 
>> Bible corpora, but Europarl only covers European languages, and the Bible 
>> corpus is quite small in MT terms.
>> 
>> TED and MULTI-UN are options, but as far as I know the data is only 
>> bilingually aligned at the moment, and it can be a bit hard to get a clean 
>> multi-lingual corpus from them. If anyone has any experience with this, or 
>> resource available, I'd love some info.
>> 
>> Thanks in advance,
>> Graham
>> 
>> 
>> _______________________________________________
>> Moses-support mailing list
>> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>> http://mailman.mit.edu/mailman/listinfo/moses-support 
>> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> 
> 
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support 
> <http://mailman.mit.edu/mailman/listinfo/moses-support>
> 
> 
> 
> 
> -- 
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
> http://mailman.mit.edu/mailman/listinfo/moses-support

All the best,
Jörg


Jörg Tiedemann
tiede...@gmail.com






_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to