Hi Llio,

You've got a lot of questions spread around in this message. I'll try  
to get to most of them.

>>
>> Dear Moses Group,
>>
>> I am having difficulties running the Moses software (not the recently
>> released version), following the guidelines at
>> http://www.statmt.org/wmt07/baseline.html and I attach a record of  
>> the
>> final part of the terminal session for your information.
>>
>> I started with parallel input files, with each line containing one
>> sentence, both already tokenised, tab delimited, and in ASCII (is
>> UTF-8 better?)

Moses itself is encoding-agnostic - use whatever encoding you want.  
Some of the support scripts on statmt.org (tokenizer.perl, for  
example) are geared to work better with UTF-8.  I find UTF-8 a lot  
easier to use -- especially when you start dealing with multiple  
language pairs with different native encodings.

>> I followed the instructions under the Prepare Data heading.  I  
>> briefly
>> inspected the .tok output files, and preferred the original tokenised
>> version e.g. reference numbers with / were not split up.  So, I
>> renamed the original input files as .tok files, filtered out long
>> sentences and lowercased the training data.

I think you're saying you didn't like the behavior of our sample  
tokenizer with regards to some feature in the training data. If your  
original files are already tokenized in some way, you can just use  
that data instead of re-applying tokenization. Some form of  
tokenization is definitely important though: you don't want "no,"  
"no!" "no." and "no?" to all be treated as distinct words instead of  
multiple instances of the word "no".

>> I then proceeded to the Language Model. The instructions seemed  
>> pretty
>> much the same as for the Prepare Data section, so I moved the
>> lowercased files from the corpus directory to the lm directory. Is
>> this the right thing to do?

This is an *acceptable* thing to do, but maybe not the best choice.  
More data for language models is always better. When we make the  
Europarl data parallel for a given language pair, we drop mis-matched  
sentences, paragraphs, even whole documents that don't have a version  
in both languages. In the Prepare Data section, as you mentioned, we  
filter out long sentences. All of that dropped data on the target side  
can be useful to the language model. That's why a non-paired  
monolingual .en file is used in the example, and is only tokenized and  
lowercased, not filtered for long sentences.

>> I then trained the model and the system crashed with the following message:-
>>
>> Executing: bin/moses-scripts/scripts-20080125-1939/training/phrase- 
>> extract/extract
>> ./model/aligned.0.en ./model/aligned.0.cy
>> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation
>> PhraseExtract v1.3.0, written by Philipp Koehn
>> phrase extraction from an aligned parallel corpus
>> (also extracting orientation)
>> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o
>> cat: ./model/extract.0-0.o.part*: No such file or directory
>> Exit code: 1
>> Died at bin/moses-scripts/scripts-20080125-1939/training/train- 
>> factored-phrase-model.perl
>> line 899.
>>
>> So, my question is: am I giving Moses the wrong data to work with?

I think it's more likely that some file is misplaced (you say you  
'moved' the lowercased files to the lm directory - did you copy them  
or delete them?) or that some part of the train-factored-phrase- 
model.perl process isn't running correctly. The full stdout/stderr of  
the perl script should help you debug what is getting done and what is  
failing. The "Executing:" calls are just copies of what is sent to the  
command line, so you can always try copy and pasting that and running  
it yourself outside of the perl script to debug what's going wrong.  
You've got the perl script, too, so poke around inside it and figure  
out what it's doing. That's the beauty of open-source. :)

>> In order to find out, I downloaded europarl from
>> http://www.statmt.org/europarl/.  It contained version 2 rather than
>> version 3 but I thought nevertheless that I might try using it.  I  
>> ran
>> sentence-align-corpus.perl:

The downloads from that page contain version 3, not v2. What made you  
think it was version 2? Maybe we missed a readme somewhere, but the  
data is v3 for sure.

>> ./sentence-align-corpus.perl en de
>>
>> , but it exited with the following message:
>>
>> Died at ./sentence-align-corpus.perl line 16.
>>
>> sentence-align-corpus.perl line 16 says:
>> die unless -e "$dir/$l1";

Yeah, there was a bug in sentence-align-corpus. Line 9 should read

my $dir = "txt";

It was looking in the wrong directory. You can either fix your version  
or re-download the tools.tgz file from the Europarl page.

>> Should I continue with europarl 2 or is it possible to download
>> europarl 3 from somewhere?

See above. v3 is what is available. v2 is available in an archive page  
at <http://www.statmt.org/europarl/archives.html>

>> Alternatively would it be possible for you to explain the difference
>> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and
>> wmt07/training/europarl-v3.en?

You can get the files that tutorial is talking about from 
<http://www.statmt.org/wmt07/shared-task.html#download 
 > and look through them yourself. The europarl-v3.fr-en.* files come  
in a pair. There should be europarl-v3.fr-en.en and europarl-v3.fr- 
en.fr.  All 3 files have one sentence per line, europarl-v3.fr-en.en  
and europarl-v3.fr-en.fr have an identical number of lines, and  
europarl-v3.en has a superset of the europarl-v3.fr-en.en data.  
Expanding on what I said about LM data above, more data can go into  
the non-paired file because we don't have to match documents across  
two languages. We need paired data for word alignments, but any  
monolingual target data is useful for language modeling.

>> Just to clarify: am I correct in
>> saying that the Prepare Data section is about training the  
>> translation
>> model i.e. word and phrase alignments, and Language model section is
>> about creating a language model for the language we're translating  
>> to?

Correct.

>> Does the Prepare Data section start with two plain text parallel
>> corpora with sentences on each line or  is something more elaborate
>> than that?  Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain
>> text file with French sentence 1 followed by English sentence 1
>> followed by French sentence 2 followed by English sentence 2 etc?  I
>> could then adapt the Welsh-English corpus I'm using accordingly.

These paired files should have exactly the same number of lines. Line  
1 in .en and Line 1 in .fr should be the same sentence, one file in  
English and one in French. These are the results of running sentence- 
align-corpus, combining all the files for each language, and filtering  
out the lines with XML tags. If you want to play with prepared files  
and not "roll your own" from the Europarl data, check out the wmt07  
and wmt08 websites for downloadable monolingual and parallel training  
data.

>> Otherwise, is there a problem with the software/implementation on a
>> Mac system? Would you recommend that I try the recently released
>> version of Moses?  Is there some way to install the new version of
>> Moses without uninstalling the other one (I'm wondering about
>> environment variables)

I've run the decoder on my mac laptop just fine. You may have to  
change a few scripts for training - for example, I know the mac uses  
'gzcat' instead of 'zcat'. Moses doesn't use environment variables.  
Compile it in a different directory and you've got a second copy!


Good luck!

Josh

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to