Hi Llio, You've got a lot of questions spread around in this message. I'll try to get to most of them.
>> >> Dear Moses Group, >> >> I am having difficulties running the Moses software (not the recently >> released version), following the guidelines at >> http://www.statmt.org/wmt07/baseline.html and I attach a record of >> the >> final part of the terminal session for your information. >> >> I started with parallel input files, with each line containing one >> sentence, both already tokenised, tab delimited, and in ASCII (is >> UTF-8 better?) Moses itself is encoding-agnostic - use whatever encoding you want. Some of the support scripts on statmt.org (tokenizer.perl, for example) are geared to work better with UTF-8. I find UTF-8 a lot easier to use -- especially when you start dealing with multiple language pairs with different native encodings. >> I followed the instructions under the Prepare Data heading. I >> briefly >> inspected the .tok output files, and preferred the original tokenised >> version e.g. reference numbers with / were not split up. So, I >> renamed the original input files as .tok files, filtered out long >> sentences and lowercased the training data. I think you're saying you didn't like the behavior of our sample tokenizer with regards to some feature in the training data. If your original files are already tokenized in some way, you can just use that data instead of re-applying tokenization. Some form of tokenization is definitely important though: you don't want "no," "no!" "no." and "no?" to all be treated as distinct words instead of multiple instances of the word "no". >> I then proceeded to the Language Model. The instructions seemed >> pretty >> much the same as for the Prepare Data section, so I moved the >> lowercased files from the corpus directory to the lm directory. Is >> this the right thing to do? This is an *acceptable* thing to do, but maybe not the best choice. More data for language models is always better. When we make the Europarl data parallel for a given language pair, we drop mis-matched sentences, paragraphs, even whole documents that don't have a version in both languages. In the Prepare Data section, as you mentioned, we filter out long sentences. All of that dropped data on the target side can be useful to the language model. That's why a non-paired monolingual .en file is used in the example, and is only tokenized and lowercased, not filtered for long sentences. >> I then trained the model and the system crashed with the following message:- >> >> Executing: bin/moses-scripts/scripts-20080125-1939/training/phrase- >> extract/extract >> ./model/aligned.0.en ./model/aligned.0.cy >> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation >> PhraseExtract v1.3.0, written by Philipp Koehn >> phrase extraction from an aligned parallel corpus >> (also extracting orientation) >> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o >> cat: ./model/extract.0-0.o.part*: No such file or directory >> Exit code: 1 >> Died at bin/moses-scripts/scripts-20080125-1939/training/train- >> factored-phrase-model.perl >> line 899. >> >> So, my question is: am I giving Moses the wrong data to work with? I think it's more likely that some file is misplaced (you say you 'moved' the lowercased files to the lm directory - did you copy them or delete them?) or that some part of the train-factored-phrase- model.perl process isn't running correctly. The full stdout/stderr of the perl script should help you debug what is getting done and what is failing. The "Executing:" calls are just copies of what is sent to the command line, so you can always try copy and pasting that and running it yourself outside of the perl script to debug what's going wrong. You've got the perl script, too, so poke around inside it and figure out what it's doing. That's the beauty of open-source. :) >> In order to find out, I downloaded europarl from >> http://www.statmt.org/europarl/. It contained version 2 rather than >> version 3 but I thought nevertheless that I might try using it. I >> ran >> sentence-align-corpus.perl: The downloads from that page contain version 3, not v2. What made you think it was version 2? Maybe we missed a readme somewhere, but the data is v3 for sure. >> ./sentence-align-corpus.perl en de >> >> , but it exited with the following message: >> >> Died at ./sentence-align-corpus.perl line 16. >> >> sentence-align-corpus.perl line 16 says: >> die unless -e "$dir/$l1"; Yeah, there was a bug in sentence-align-corpus. Line 9 should read my $dir = "txt"; It was looking in the wrong directory. You can either fix your version or re-download the tools.tgz file from the Europarl page. >> Should I continue with europarl 2 or is it possible to download >> europarl 3 from somewhere? See above. v3 is what is available. v2 is available in an archive page at <http://www.statmt.org/europarl/archives.html> >> Alternatively would it be possible for you to explain the difference >> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and >> wmt07/training/europarl-v3.en? You can get the files that tutorial is talking about from <http://www.statmt.org/wmt07/shared-task.html#download > and look through them yourself. The europarl-v3.fr-en.* files come in a pair. There should be europarl-v3.fr-en.en and europarl-v3.fr- en.fr. All 3 files have one sentence per line, europarl-v3.fr-en.en and europarl-v3.fr-en.fr have an identical number of lines, and europarl-v3.en has a superset of the europarl-v3.fr-en.en data. Expanding on what I said about LM data above, more data can go into the non-paired file because we don't have to match documents across two languages. We need paired data for word alignments, but any monolingual target data is useful for language modeling. >> Just to clarify: am I correct in >> saying that the Prepare Data section is about training the >> translation >> model i.e. word and phrase alignments, and Language model section is >> about creating a language model for the language we're translating >> to? Correct. >> Does the Prepare Data section start with two plain text parallel >> corpora with sentences on each line or is something more elaborate >> than that? Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain >> text file with French sentence 1 followed by English sentence 1 >> followed by French sentence 2 followed by English sentence 2 etc? I >> could then adapt the Welsh-English corpus I'm using accordingly. These paired files should have exactly the same number of lines. Line 1 in .en and Line 1 in .fr should be the same sentence, one file in English and one in French. These are the results of running sentence- align-corpus, combining all the files for each language, and filtering out the lines with XML tags. If you want to play with prepared files and not "roll your own" from the Europarl data, check out the wmt07 and wmt08 websites for downloadable monolingual and parallel training data. >> Otherwise, is there a problem with the software/implementation on a >> Mac system? Would you recommend that I try the recently released >> version of Moses? Is there some way to install the new version of >> Moses without uninstalling the other one (I'm wondering about >> environment variables) I've run the decoder on my mac laptop just fine. You may have to change a few scripts for training - for example, I know the mac uses 'gzcat' instead of 'zcat'. Moses doesn't use environment variables. Compile it in a different directory and you've got a second copy! Good luck! Josh -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
