Dear Hieu, this is most useful. Thank you very much for the lead. Do you know the giza program I need to amend? I take it that the file should not be overwritten. Is this the same filename always or does it depend on the input I give the system? Many thanks, Llio Humphreys
On Fri, Aug 1, 2008 at 5:07 PM, Hieu Hoang <[EMAIL PROTECTED]> wrote: > this may be a smilar problem that was encountered by the UPV guys when > running under cygwin > > the Mac filesystem is case INSENSITIVE. > http://docs.info.apple.com/article.html?artnum=107863 > however, giza++ creates 2 files which have the same name but just different > cases, eg > blah.a3.final > blah.A3.final > 1 overwrites the other. > > you need to change the giza++ code, or run under a case senesitive > filesystem. ideally, it should be changed in the trunk giza++ code > > > -----Original Message----- > From: Josh Schroeder [mailto:[EMAIL PROTECTED] > Sent: 01 August 2008 16:56 > To: Hieu Hoang > Subject: Fwd: [Moses-support] Moses: Prepare Data, Build Language Model and > Train Model > > > > Begin forwarded message: > >> From: "Llio Humphreys" <[EMAIL PROTECTED]> >> Date: 25 July 2008 10:00:00 BST >> To: moses-support <[email protected]> >> Subject: [Moses-support] Moses: Prepare Data, Build Language Model and >> Train Model >> >> Please see message without attachment. Thank you, Llio Humphreys >> >> On Fri, Jul 25, 2008 at 9:50 AM, Llio Humphreys >> <[EMAIL PROTECTED]> wrote: >>> Dear Moses Group, >>> >>> I am having difficulties running the Moses software (not the recently >>> released version), following the guidelines at >>> http://www.statmt.org/wmt07/baseline.html and I attach a record of >>> the final part of the terminal session for your information. >>> >>> I started with parallel input files, with each line containing one >>> sentence, both already tokenised, tab delimited, and in ASCII (is >>> UTF-8 better?) >>> >>> I followed the instructions under the Prepare Data heading. I >>> briefly inspected the .tok output files, and preferred the original >>> tokenised version e.g. reference numbers with / were not split up. >>> So, I renamed the original input files as .tok files, filtered out >>> long sentences and lowercased the training data. >>> >>> I then proceeded to the Language Model. The instructions seemed >>> pretty much the same as for the Prepare Data section, so I moved the >>> lowercased files from the corpus directory to the lm directory. Is >>> this the right thing to do? >>> >>> I then trained the model and the system crashed with the following >>> message:- >>> >>> Executing: bin/moses-scripts/scripts-20080125-1939/training/phrase- >>> extract/extract >>> ./model/aligned.0.en ./model/aligned.0.cy >>> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation >>> PhraseExtract v1.3.0, written by Philipp Koehn phrase extraction from >>> an aligned parallel corpus (also extracting orientation) >>> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o >>> cat: ./model/extract.0-0.o.part*: No such file or directory Exit >>> code: 1 Died at >>> bin/moses-scripts/scripts-20080125-1939/training/train- >>> factored-phrase-model.perl >>> line 899. >>> >>> So, my question is: am I giving Moses the wrong data to work with? >>> >>> In order to find out, I downloaded europarl from >>> http://www.statmt.org/europarl/. It contained version 2 rather than >>> version 3 but I thought nevertheless that I might try using it. I >>> ran >>> sentence-align-corpus.perl: >>> >>> ./sentence-align-corpus.perl en de >>> >>> , but it exited with the following message: >>> >>> Died at ./sentence-align-corpus.perl line 16. >>> >>> sentence-align-corpus.perl line 16 says: >>> die unless -e "$dir/$l1"; >>> >>> Should I continue with europarl 2 or is it possible to download >>> europarl 3 from somewhere? >>> >>> Alternatively would it be possible for you to explain the difference >>> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and >>> wmt07/training/europarl-v3.en? Just to clarify: am I correct in >>> saying that the Prepare Data section is about training the >>> translation model i.e. word and phrase alignments, and Language model >>> section is about creating a language model for the language we're >>> translating to? >>> Does the Prepare Data section start with two plain text parallel >>> corpora with sentences on each line or is something more elaborate >>> than that? Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain >>> text file with French sentence 1 followed by English sentence 1 >>> followed by French sentence 2 followed by English sentence 2 etc? I >>> could then adapt the Welsh-English corpus I'm using accordingly. >>> >>> Otherwise, is there a problem with the software/implementation on a >>> Mac system? Would you recommend that I try the recently released >>> version of Moses? Is there some way to install the new version of >>> Moses without uninstalling the other one (I'm wondering about >>> environment variables) >>> >>> Thank you, >>> Llio Humphreys >>> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > -- > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
