The DGT translation memories are truly multilingually aligned: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
Otherwise you would have several multilingual corpora in OPUS even though they are all bilingually aligned. In most cases it is quite straightforward to combine the sentence alignments to extract multilingual corpora. I would recommend that you'd use the standoff annotation of the sentence alignments in OPUS (ces files). I have a simple script that does that: http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2multi After that you could use this script to convert to Moses format if you like: http://opus.lingfil.uu.se/tools/public/opus-tools/scripts/opus2moses The biggest multilingual corpus would be OpenSubtitles: http://opus.lingfil.uu.se/OpenSubtitles2016.php But here the problem is that it’s not always the same subtitle version that is aligned for each language pair. I have just recently compiled the intra-lingual alignments for different subtitle variants. Those links could be used to transitively map to other languages again, but for this you have to implement your own tool. Intra-lingual links are here: http://opus.lingfil.uu.se/OpenSubtitles2016alt.php (in the column “all”) The coverage is very different for various languages. You should select languages for which you would expect to have a good coverage of the same movies and their subtitles. It’s a bit complicated but should be doable. All the best and good luck, Jörg ********************************************************************************** Jörg Tiedemann Department of Modern Languages http://www.helsinki.fi/~tiedeman/ University of Helsinki On 22 Jan 2016, at 17:55, Lane Schwartz <[email protected]<mailto:[email protected]>> wrote: Marcin, That sounds great! Yes, please do make an announcement. I would definitely make use of such a multi-aligned corpus. Lane On Fri, Jan 22, 2016 at 9:36 AM, Marcin Junczys-Dowmunt <[email protected]<mailto:[email protected]>> wrote: Hi Graham, At the UN we are now working to release an official version of our data. As a bonus to the pair-wise alignment, it will contain a 6-way fully aligned subcorpus for English, French, Spanish, Russian, Chinese, Arabic; about 13M segments per language. We are waiting for some LREC feedback and the official greenlight from UN officials, but that should be a matter of a couple of weeks now (maybe one, maybe two, maybe four). Once it is ready I can make an announcement here. Best, Marcin W dniu 22.01.2016 o 16:26, Graham Neubig pisze: Dear Moses Mailing List, This is not directly related to Moses, but I was wondering if there are any high-quality, multi-lingually sentence aligned corpora available (i.e. 3 or more languages with aligned sentences). We're aware of the Europarl and Bible corpora, but Europarl only covers European languages, and the Bible corpus is quite small in MT terms. TED and MULTI-UN are options, but as far as I know the data is only bilingually aligned at the moment, and it can be a bit hard to get a clean multi-lingual corpus from them. If anyone has any experience with this, or resource available, I'd love some info. Thanks in advance, Graham _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
