Dear Josh, many thanks for answering all these questions. I will try your suggestions. Regards, Llio
On Fri, Jul 25, 2008 at 12:08 PM, Josh Schroeder <[EMAIL PROTECTED]> wrote: > Hi Llio, > > You've got a lot of questions spread around in this message. I'll try to get > to most of them. > >>> >>> Dear Moses Group, >>> >>> I am having difficulties running the Moses software (not the recently >>> released version), following the guidelines at >>> http://www.statmt.org/wmt07/baseline.html and I attach a record of the >>> final part of the terminal session for your information. >>> >>> I started with parallel input files, with each line containing one >>> sentence, both already tokenised, tab delimited, and in ASCII (is >>> UTF-8 better?) > > Moses itself is encoding-agnostic - use whatever encoding you want. Some of > the support scripts on statmt.org (tokenizer.perl, for example) are geared > to work better with UTF-8. I find UTF-8 a lot easier to use -- especially > when you start dealing with multiple language pairs with different native > encodings. > >>> I followed the instructions under the Prepare Data heading. I briefly >>> inspected the .tok output files, and preferred the original tokenised >>> version e.g. reference numbers with / were not split up. So, I >>> renamed the original input files as .tok files, filtered out long >>> sentences and lowercased the training data. > > I think you're saying you didn't like the behavior of our sample tokenizer > with regards to some feature in the training data. If your original files > are already tokenized in some way, you can just use that data instead of > re-applying tokenization. Some form of tokenization is definitely important > though: you don't want "no," "no!" "no." and "no?" to all be treated as > distinct words instead of multiple instances of the word "no". > >>> I then proceeded to the Language Model. The instructions seemed pretty >>> much the same as for the Prepare Data section, so I moved the >>> lowercased files from the corpus directory to the lm directory. Is >>> this the right thing to do? > > This is an *acceptable* thing to do, but maybe not the best choice. More > data for language models is always better. When we make the Europarl data > parallel for a given language pair, we drop mis-matched sentences, > paragraphs, even whole documents that don't have a version in both > languages. In the Prepare Data section, as you mentioned, we filter out long > sentences. All of that dropped data on the target side can be useful to the > language model. That's why a non-paired monolingual .en file is used in the > example, and is only tokenized and lowercased, not filtered for long > sentences. > >>> I then trained the model and the system crashed with the following >>> message:- >>> >>> Executing: >>> bin/moses-scripts/scripts-20080125-1939/training/phrase-extract/extract >>> ./model/aligned.0.en ./model/aligned.0.cy >>> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation >>> PhraseExtract v1.3.0, written by Philipp Koehn >>> phrase extraction from an aligned parallel corpus >>> (also extracting orientation) >>> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o >>> cat: ./model/extract.0-0.o.part*: No such file or directory >>> Exit code: 1 >>> Died at >>> bin/moses-scripts/scripts-20080125-1939/training/train-factored-phrase-model.perl >>> line 899. >>> >>> So, my question is: am I giving Moses the wrong data to work with? > > I think it's more likely that some file is misplaced (you say you 'moved' > the lowercased files to the lm directory - did you copy them or delete > them?) or that some part of the train-factored-phrase-model.perl process > isn't running correctly. The full stdout/stderr of the perl script should > help you debug what is getting done and what is failing. The "Executing:" > calls are just copies of what is sent to the command line, so you can always > try copy and pasting that and running it yourself outside of the perl script > to debug what's going wrong. You've got the perl script, too, so poke around > inside it and figure out what it's doing. That's the beauty of open-source. > :) > >>> In order to find out, I downloaded europarl from >>> http://www.statmt.org/europarl/. It contained version 2 rather than >>> version 3 but I thought nevertheless that I might try using it. I ran >>> sentence-align-corpus.perl: > > The downloads from that page contain version 3, not v2. What made you think > it was version 2? Maybe we missed a readme somewhere, but the data is v3 for > sure. > >>> ./sentence-align-corpus.perl en de >>> >>> , but it exited with the following message: >>> >>> Died at ./sentence-align-corpus.perl line 16. >>> >>> sentence-align-corpus.perl line 16 says: >>> die unless -e "$dir/$l1"; > > Yeah, there was a bug in sentence-align-corpus. Line 9 should read > > my $dir = "txt"; > > It was looking in the wrong directory. You can either fix your version or > re-download the tools.tgz file from the Europarl page. > >>> Should I continue with europarl 2 or is it possible to download >>> europarl 3 from somewhere? > > See above. v3 is what is available. v2 is available in an archive page at > <http://www.statmt.org/europarl/archives.html> > >>> Alternatively would it be possible for you to explain the difference >>> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and >>> wmt07/training/europarl-v3.en? > > You can get the files that tutorial is talking about from > <http://www.statmt.org/wmt07/shared-task.html#download> and look through > them yourself. The europarl-v3.fr-en.* files come in a pair. There should be > europarl-v3.fr-en.en and europarl-v3.fr-en.fr. All 3 files have one > sentence per line, europarl-v3.fr-en.en and europarl-v3.fr-en.fr have an > identical number of lines, and europarl-v3.en has a superset of the > europarl-v3.fr-en.en data. Expanding on what I said about LM data above, > more data can go into the non-paired file because we don't have to match > documents across two languages. We need paired data for word alignments, but > any monolingual target data is useful for language modeling. > >>> Just to clarify: am I correct in >>> saying that the Prepare Data section is about training the translation >>> model i.e. word and phrase alignments, and Language model section is >>> about creating a language model for the language we're translating to? > > Correct. > >>> Does the Prepare Data section start with two plain text parallel >>> corpora with sentences on each line or is something more elaborate >>> than that? Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain >>> text file with French sentence 1 followed by English sentence 1 >>> followed by French sentence 2 followed by English sentence 2 etc? I >>> could then adapt the Welsh-English corpus I'm using accordingly. > > These paired files should have exactly the same number of lines. Line 1 in > .en and Line 1 in .fr should be the same sentence, one file in English and > one in French. These are the results of running sentence-align-corpus, > combining all the files for each language, and filtering out the lines with > XML tags. If you want to play with prepared files and not "roll your own" > from the Europarl data, check out the wmt07 and wmt08 websites for > downloadable monolingual and parallel training data. > >>> Otherwise, is there a problem with the software/implementation on a >>> Mac system? Would you recommend that I try the recently released >>> version of Moses? Is there some way to install the new version of >>> Moses without uninstalling the other one (I'm wondering about >>> environment variables) > > I've run the decoder on my mac laptop just fine. You may have to change a > few scripts for training - for example, I know the mac uses 'gzcat' instead > of 'zcat'. Moses doesn't use environment variables. Compile it in a > different directory and you've got a second copy! > > > Good luck! > > Josh > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
