Re: [Moses-support] Moses: Prepare Data, Build Language Model and Train Model

Llio Humphreys Tue, 29 Jul 2008 06:28:06 -0700

Dear Josh,
many thanks for answering all these questions.  I will try your suggestions.
Regards,
Llio


On Fri, Jul 25, 2008 at 12:08 PM, Josh Schroeder <[EMAIL PROTECTED]> wrote:
> Hi Llio,
>
> You've got a lot of questions spread around in this message. I'll try to get
> to most of them.
>
>>>
>>> Dear Moses Group,
>>>
>>> I am having difficulties running the Moses software (not the recently
>>> released version), following the guidelines at
>>> http://www.statmt.org/wmt07/baseline.html and I attach a record of the
>>> final part of the terminal session for your information.
>>>
>>> I started with parallel input files, with each line containing one
>>> sentence, both already tokenised, tab delimited, and in ASCII (is
>>> UTF-8 better?)
>
> Moses itself is encoding-agnostic - use whatever encoding you want. Some of
> the support scripts on statmt.org (tokenizer.perl, for example) are geared
> to work better with UTF-8.  I find UTF-8 a lot easier to use -- especially
> when you start dealing with multiple language pairs with different native
> encodings.
>
>>> I followed the instructions under the Prepare Data heading.  I briefly
>>> inspected the .tok output files, and preferred the original tokenised
>>> version e.g. reference numbers with / were not split up.  So, I
>>> renamed the original input files as .tok files, filtered out long
>>> sentences and lowercased the training data.
>
> I think you're saying you didn't like the behavior of our sample tokenizer
> with regards to some feature in the training data. If your original files
> are already tokenized in some way, you can just use that data instead of
> re-applying tokenization. Some form of tokenization is definitely important
> though: you don't want "no," "no!" "no." and "no?" to all be treated as
> distinct words instead of multiple instances of the word "no".
>
>>> I then proceeded to the Language Model. The instructions seemed pretty
>>> much the same as for the Prepare Data section, so I moved the
>>> lowercased files from the corpus directory to the lm directory. Is
>>> this the right thing to do?
>
> This is an *acceptable* thing to do, but maybe not the best choice. More
> data for language models is always better. When we make the Europarl data
> parallel for a given language pair, we drop mis-matched sentences,
> paragraphs, even whole documents that don't have a version in both
> languages. In the Prepare Data section, as you mentioned, we filter out long
> sentences. All of that dropped data on the target side can be useful to the
> language model. That's why a non-paired monolingual .en file is used in the
> example, and is only tokenized and lowercased, not filtered for long
> sentences.
>
>>> I then trained the model and the system crashed with the following
>>> message:-
>>>
>>> Executing:
>>> bin/moses-scripts/scripts-20080125-1939/training/phrase-extract/extract
>>> ./model/aligned.0.en ./model/aligned.0.cy
>>> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation
>>> PhraseExtract v1.3.0, written by Philipp Koehn
>>> phrase extraction from an aligned parallel corpus
>>> (also extracting orientation)
>>> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o
>>> cat: ./model/extract.0-0.o.part*: No such file or directory
>>> Exit code: 1
>>> Died at
>>> bin/moses-scripts/scripts-20080125-1939/training/train-factored-phrase-model.perl
>>> line 899.
>>>
>>> So, my question is: am I giving Moses the wrong data to work with?
>
> I think it's more likely that some file is misplaced (you say you 'moved'
> the lowercased files to the lm directory - did you copy them or delete
> them?) or that some part of the train-factored-phrase-model.perl process
> isn't running correctly. The full stdout/stderr of the perl script should
> help you debug what is getting done and what is failing. The "Executing:"
> calls are just copies of what is sent to the command line, so you can always
> try copy and pasting that and running it yourself outside of the perl script
> to debug what's going wrong. You've got the perl script, too, so poke around
> inside it and figure out what it's doing. That's the beauty of open-source.
> :)
>
>>> In order to find out, I downloaded europarl from
>>> http://www.statmt.org/europarl/.  It contained version 2 rather than
>>> version 3 but I thought nevertheless that I might try using it.  I ran
>>> sentence-align-corpus.perl:
>
> The downloads from that page contain version 3, not v2. What made you think
> it was version 2? Maybe we missed a readme somewhere, but the data is v3 for
> sure.
>
>>> ./sentence-align-corpus.perl en de
>>>
>>> , but it exited with the following message:
>>>
>>> Died at ./sentence-align-corpus.perl line 16.
>>>
>>> sentence-align-corpus.perl line 16 says:
>>> die unless -e "$dir/$l1";
>
> Yeah, there was a bug in sentence-align-corpus. Line 9 should read
>
> my $dir = "txt";
>
> It was looking in the wrong directory. You can either fix your version or
> re-download the tools.tgz file from the Europarl page.
>
>>> Should I continue with europarl 2 or is it possible to download
>>> europarl 3 from somewhere?
>
> See above. v3 is what is available. v2 is available in an archive page at
> <http://www.statmt.org/europarl/archives.html>
>
>>> Alternatively would it be possible for you to explain the difference
>>> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and
>>> wmt07/training/europarl-v3.en?
>
> You can get the files that tutorial is talking about from
> <http://www.statmt.org/wmt07/shared-task.html#download> and look through
> them yourself. The europarl-v3.fr-en.* files come in a pair. There should be
> europarl-v3.fr-en.en and europarl-v3.fr-en.fr.  All 3 files have one
> sentence per line, europarl-v3.fr-en.en and europarl-v3.fr-en.fr have an
> identical number of lines, and europarl-v3.en has a superset of the
> europarl-v3.fr-en.en data. Expanding on what I said about LM data above,
> more data can go into the non-paired file because we don't have to match
> documents across two languages. We need paired data for word alignments, but
> any monolingual target data is useful for language modeling.
>
>>> Just to clarify: am I correct in
>>> saying that the Prepare Data section is about training the translation
>>> model i.e. word and phrase alignments, and Language model section is
>>> about creating a language model for the language we're translating to?
>
> Correct.
>
>>> Does the Prepare Data section start with two plain text parallel
>>> corpora with sentences on each line or  is something more elaborate
>>> than that?  Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain
>>> text file with French sentence 1 followed by English sentence 1
>>> followed by French sentence 2 followed by English sentence 2 etc?  I
>>> could then adapt the Welsh-English corpus I'm using accordingly.
>
> These paired files should have exactly the same number of lines. Line 1 in
> .en and Line 1 in .fr should be the same sentence, one file in English and
> one in French. These are the results of running sentence-align-corpus,
> combining all the files for each language, and filtering out the lines with
> XML tags. If you want to play with prepared files and not "roll your own"
> from the Europarl data, check out the wmt07 and wmt08 websites for
> downloadable monolingual and parallel training data.
>
>>> Otherwise, is there a problem with the software/implementation on a
>>> Mac system? Would you recommend that I try the recently released
>>> version of Moses?  Is there some way to install the new version of
>>> Moses without uninstalling the other one (I'm wondering about
>>> environment variables)
>
> I've run the decoder on my mac laptop just fine. You may have to change a
> few scripts for training - for example, I know the mac uses 'gzcat' instead
> of 'zcat'. Moses doesn't use environment variables. Compile it in a
> different directory and you've got a second copy!
>
>
> Good luck!
>
> Josh
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Moses: Prepare Data, Build Language Model and Train Model

Reply via email to