Re: [Moses-support] 10 years of OPUS

Tom Hoar Wed, 06 Nov 2013 07:48:01 -0800

You make some good points, Jörg.

I think the word alignments and phrase tables would be more useful ifyou reference what tokenizers and other pre/post-processing tools/chainswere used for each pair and their versions. Some people may wish toaugment the alignments and/or tables with their own corpora forincremental training or new language models will need to re-create thepreparation toolchains. Tools change from time to time, even the Mosestokenizer.perl script.

It looks like the reordering tables are missing from your collection.How useful are the phrase tables without them? For basic phrase-basedmode, don't users still need to run the step to create the reorderingtables?




On 11/06/2013 08:05 PM, joerg wrote:

This is a good question. I was also wondering about this and, as Iwrote, I started providing word alignments and even phrase tables forthe data collected in OPUS. The idea is, of course, that people couldskip running GIZA++ with standard settings over and over again onEuroparl data and other standard sets. For example, you can downloadnow word alignments for all language pairs for Europarl v7 from OPUS
http://opus.lingfil.uu.se/Europarl/wordalign/
This, of course, assumes that you're happy with my tokenization andother kinds of pre-processing that may have influenced the data. Insome cases you may also want to do other things like compoundsplitting and preordering which would require a different alignment model.
I also have phrase tables in those folders as well extracted withstandard settings from the grow-diag-final-and alignments. This ismaybe less interesting as this assumes that you work with phrase-basedSMT and that you have to work with the true cased data I provide etc.The truecaser models are, of course also available (for example forEuroparl in http://opus.lingfil.uu.se/Europarl/wordalign/truecaser/).Monolingual data files with the same tokenization are also available(Europarl: http://opus.lingfil.uu.se/Europarl/mono/)
As all of this takes a lot of space, I would actually like to know, if
- I should keep those files on-line
- I could remove some of them (for example, phrase tables that takemost of the disk space)- if anyone actually is interested in using those models andalignments (and which ones)
Thank you for your feedback!

Jörg


**********************************************************************************
Jörg Tiedemann http://stp.lingfil.uu.se/~joerg/<http://stp.lingfil.uu.se/%7Ejoerg/>
On Nov 6, 2013, at 8:48 AM, Read, James C wrote:
I wonder if there would be a demand for ready made phrase tablesgenerated from the data.
________________________________________
From: [email protected]<mailto:[email protected]> [[email protected]<mailto:[email protected]>] on behalf of Jorg Tiedemann[[email protected] <mailto:[email protected]>]
Sent: 02 November 2013 18:54
To: [email protected] <mailto:[email protected]>
Subject: [Moses-support] 10 years of OPUS
After attending the 20-years-of-bitext workshop at EMNLP I suddenlyrealized that OPUS (http://opus.lingfil.uu.se) also has its 10-yearsanniversary this year (send me some champagne if you like). I willcelebrate this anniversary by sending out this e-mail with somerecent news and highlights.
OPUS is a growing collection of parallel corpora for many languagesand various domains. The collection becomes pretty big and includes avariety of data sets and tools that are not only useful forstatistical machine translation. OPUS has been extended a lot sinceits first appearance in 2003. Actually the best birthday presentwould be if anyone would decide to start a mirror of OPUS. Let meknow if you are interested.
Here some of the highlights:

- over 150 languages and language variants
- over 5 billion aligned translation units
- downloads in XML/XCES, plain text (Moses/SMT) and TMX
- raw, tokenized and machine-annotated data
- monolingual data sets (for language modeling)
- search interfaces


Some recent news and data sets:

- EUbookshop: a large but noisy corpus (converted from PDF)
- Tatoeba: a small but clean corpus with many languages
- OpenSubtitles2012: an improved version of the 2011 version
- coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
- UN, MultiUN, Europarl v7: aligned for all language combinations
- word alignments and phrase tables for the majority of bitexts


The Web Site: http://opus.lingfil.uu.se
More information: http://opus.lingfil.uu.se/trac/wiki

Feedback is very welcome!
And, be nice to our server!


Jörg Tiedemann
[email protected] <mailto:[email protected]>





_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] 10 years of OPUS

Reply via email to