Re: [Moses-support] 10 years of OPUS

Read, James C Wed, 06 Nov 2013 09:41:59 -0800

I should think it would be worth leaving them there. Perhaps a link on the 
Moses website baseline system page could be useful. That way somebody who wants 
to get started quickly without all the training time it takes to get up and 
running can do so.


I've got two servers with 32 8 core CPUs and 125GB  RAM each running on the 
Europarl data with the Moses training script. If we throw in getting over 
installation and dependency hurdles it's about 2 weeks work with that kind of 
setup just to get installed, learn the basics and train some models.

I'm sure that having pretrained models available could lower the entry level 
considerably.

James

________________________________
From: [email protected] [[email protected]] on behalf 
of Jörg Tiedemann [[email protected]]
Sent: 06 November 2013 17:09
To: Tom Hoar
Cc: <[email protected]>
Subject: Re: [Moses-support] 10 years of OPUS


Extracting lexicalized reordering tables shouldn't be a problem. Data files and 
word alignment are given. Note also that phrase tables exist only for one 
direction. For the other direction you will need to swap phrases, scores and 
word alignments. But this is easy as well.

Tokenization is a bit more tricky. In some cases I use the tree tagger for 
pre-processing to get better tagging results. Otherwise, it's mainly the Moses 
tokenizer with some additional exceptions and smaller changes. Good point that 
I should document this more carefully.

The question remains, should I leave the alignment files online or is it not 
useful for anyone anyway?

Jörg




On 6 nov 2013, at 16:40, Tom Hoar 
<[email protected]<mailto:[email protected]>>
 wrote:

You make some good points, Jörg.

I think the word alignments and phrase tables would be more useful if you 
reference what tokenizers and other pre/post-processing tools/chains were used 
for each pair and their versions. Some people may wish to augment the 
alignments and/or tables with their own corpora for incremental training or new 
language models will need to re-create the preparation toolchains. Tools change 
from time to time, even the Moses tokenizer.perl script.

It looks like the reordering tables are missing from your collection. How 
useful are the phrase tables without them? For basic phrase-based mode, don't 
users still need to run the step to create the reordering tables?



On 11/06/2013 08:05 PM, joerg wrote:

This is a good question. I was also wondering about this and, as I wrote, I 
started providing word alignments and even phrase tables for the data collected 
in OPUS. The idea is, of course, that people could skip running GIZA++ with 
standard settings over and over again on Europarl data and other standard sets. 
For example, you can download now word alignments for all language pairs for 
Europarl v7 from OPUS
http://opus.lingfil.uu.se/Europarl/wordalign/

This, of course, assumes that you're happy with my tokenization and other kinds 
of pre-processing that may have influenced the data. In some cases you may also 
want to do other things like compound splitting and preordering which would 
require a different alignment model.

I also have phrase tables in those folders as well extracted with standard 
settings from the grow-diag-final-and alignments. This is maybe less 
interesting as this assumes that you work with phrase-based SMT and that you 
have to work with the true cased data I provide etc. The truecaser models are, 
of course also available (for example for Europarl in 
http://opus.lingfil.uu.se/Europarl/wordalign/truecaser/). Monolingual data 
files with the same tokenization are also available (Europarl: 
http://opus.lingfil.uu.se/Europarl/mono/)


As all of this takes a lot of space, I would actually like to know, if
- I should keep those files on-line
- I could remove some of them (for example, phrase tables that take most of the 
disk space)
- if anyone actually is interested in using those models and alignments (and 
which ones)


Thank you for your feedback!

Jörg


**********************************************************************************
Jörg Tiedemann                                 
http://stp.lingfil.uu.se/~joerg/<http://stp.lingfil.uu.se/%7Ejoerg/>



On Nov 6, 2013, at 8:48 AM, Read, James C wrote:

I wonder if there would be a demand for ready made phrase tables generated from 
the data.
________________________________________
From: [email protected]<mailto:[email protected]> 
[[email protected]<mailto:[email protected]>] on behalf 
of Jorg Tiedemann [[email protected]<mailto:[email protected]>]
Sent: 02 November 2013 18:54
To: [email protected]<mailto:[email protected]>
Subject: [Moses-support] 10 years of OPUS

After attending the 20-years-of-bitext workshop at EMNLP I suddenly realized 
that OPUS (http://opus.lingfil.uu.se) also has its 10-years anniversary this 
year (send me some champagne if you like). I will celebrate this anniversary by 
sending out this e-mail with some recent news and highlights.

OPUS is a growing collection of parallel corpora for many languages and various 
domains. The collection becomes pretty big and includes a variety of data sets 
and tools that are not only useful for statistical machine translation. OPUS 
has been extended a lot since its first appearance in 2003. Actually the best 
birthday present would be if anyone would decide to start a mirror of OPUS. Let 
me know if you are interested.


Here some of the highlights:

- over 150 languages and language variants
- over 5 billion aligned translation units
- downloads in XML/XCES, plain text (Moses/SMT) and TMX
- raw, tokenized and machine-annotated data
- monolingual data sets (for language modeling)
- search interfaces


Some recent news and data sets:

- EUbookshop: a large but noisy corpus (converted from PDF)
- Tatoeba: a small but clean corpus with many languages
- OpenSubtitles2012: an improved version of the 2011 version
- coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
- UN, MultiUN, Europarl v7: aligned for all language combinations
- word alignments and phrase tables for the majority of bitexts


The Web Site: http://opus.lingfil.uu.se
More information: http://opus.lingfil.uu.se/trac/wiki

Feedback is very welcome!
And, be nice to our server!


Jörg Tiedemann
[email protected]<mailto:[email protected]>





_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support




_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] 10 years of OPUS

Reply via email to