Re: [Moses-support] Multilingually Sentence-Aligned Corpora

Jörg Tiedemann Fri, 22 Jan 2016 10:54:06 -0800

The DGT translation memories are truly multilingually aligned:
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory


Otherwise you would have several multilingual corpora in OPUS even though they 
are all bilingually aligned. In most cases it is quite straightforward to 
combine the sentence alignments to extract multilingual corpora. I would 
recommend that you'd use the standoff annotation of the sentence alignments in 
OPUS (ces files). I have a simple script that does that:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi

After that you could use this script to convert to Moses format if you like:
http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses

The biggest multilingual corpus would be OpenSubtitles:
http://opus.lingfil.uu.se/OpenSubtitles2016.php

But here the problem is that it’s not always the same subtitle version that is 
aligned for each language pair. I have just recently compiled the intra-lingual 
alignments for different subtitle variants. Those links could be used to 
transitively map to other languages again, but for this you have to implement 
your own tool. Intra-lingual links are here:
http://opus.lingfil.uu.se/OpenSubtitles2016alt.php
(in the column “all”)

The coverage is very different for various languages. You should select 
languages for which you would expect to have a good coverage of the same movies 
and their subtitles. It’s a bit complicated but should be doable.


All the best and good luck,
Jörg

**********************************************************************************
Jörg Tiedemann
Department of Modern Languages             http://www.helsinki.fi/~tiedeman/
University of Helsinki

On 22 Jan 2016, at 17:55, Lane Schwartz 
<[email protected]<mailto:[email protected]>> wrote:

Marcin,

That sounds great! Yes, please do make an announcement. I would definitely make 
use of such a multi-aligned corpus.

Lane


On Fri, Jan 22, 2016 at 9:36 AM, Marcin Junczys-Dowmunt 
<[email protected]<mailto:[email protected]>> wrote:
Hi Graham,
At the UN we are now working to release an official version of our data. As a 
bonus to the pair-wise alignment, it will contain a 6-way fully aligned 
subcorpus for English, French, Spanish, Russian, Chinese, Arabic; about 13M 
segments per language. We are waiting for some LREC feedback and the official 
greenlight from UN officials, but that should be a matter of a couple of weeks 
now (maybe one, maybe two, maybe four). Once it is ready I can make an 
announcement here.
Best,
Marcin

W dniu 22.01.2016 o 16:26, Graham Neubig pisze:
Dear Moses Mailing List,

This is not directly related to Moses, but I was wondering if there are any 
high-quality, multi-lingually sentence aligned corpora available (i.e. 3 or 
more languages with aligned sentences). We're aware of the Europarl and Bible 
corpora, but Europarl only covers European languages, and the Bible corpus is 
quite small in MT terms.

TED and MULTI-UN are options, but as far as I know the data is only bilingually 
aligned at the moment, and it can be a bit hard to get a clean multi-lingual 
corpus from them. If anyone has any experience with this, or resource 
available, I'd love some info.

Thanks in advance,
Graham



_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support



_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support




--
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"
_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Multilingually Sentence-Aligned Corpora

Reply via email to