[Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

Llio Humphreys Thu, 14 Aug 2008 04:03:42 -0700

Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
thank you all for your help.  It is very, very much appreciated. I
decided to try Eric's packages, and it looks like the installation
worked.  I typed some of the
 commands in the Baseline instructions without arguments, and the
 program either output to the screen that I missed some arguments or
 gave a description of the program.  Thank you Eric!!!


 Following the Baseline instructions
 (http://www.statmt.org/wmt08/baseline.html) I have now got to the
 following step:

 Use SRILM to build language model:
 /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
 -text working-dir/lm/europarl.lowercased -lm
 working-dir/lm/europarl.lm

 In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
 path to ngram-count, but it was possible to invoke it without the
 path:

 ngram-count -order 5 -interpolate -kndiscount -text
 europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm

 I'm concerned about two things:
 1) this ngram-count step is taking a very long time.  I think I started
 it off around 6pm yesterday, but it's still going.  It's very
 resource-intensive, and it's difficult to get to  other windows open.
 I went to check up on it around 9pm, and couldn't find that particular
 terminal.  I thought I had closed that terminal by mistake, so I stupidly
 opened another one, and entered the same command.  I subsequently
 found that the original terminal was still open, so I closed the
 second one.  I'm not sure if issuing this command a second time on the
 same program and files on a different terminal would corrupt the
 original ngramcount step, and whether I should start it off again, or
 whether starting it off again would make things worse?   I looked up
 ngram-count 
(http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
 and I don't think it outputs to any file, so I guess you have to be in
 the same terminal to do the next step?  I opened
 another terminal and typed 'top' to see what processes are running,
 and I know that ngram-count is doing something, but whether it's doing
 well or stuck in a loop, I can't say.  What I do find strange is that
the time for ngram-count is said to be 00:58:20, and it's been going
for hours.. I searched this problem in previous Moses Group emails and
I understand that if I run this with order 4 instead of 5 it will run
quicker with very similar results?  So, can I just stop what it's
doing, and run this command in the same terminal with order 4?  Are
there any files I need to 'touch' to ensure that it doesn't leave any
stone unturned?

 2) how to do the next step:

 
bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl
 -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir
 working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
 -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
 0:5:working-dir/lm/europarl.lm:0

I assume that like ngram-count, I can just type in
train-factored-phrase-model.perl without the full path...Do I need to
set the -scripts-root-dir paramater?  Are all the scripts in the same
place?

Thank you,

Llio




 On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote:
 > Dear Llio,
 >
 > You should be okay with installing moses finally if you have installed all
 > tha dependant packages before. I am not aware of the 'whereis' command, but
 > once you train your model, your moses.ini file which is created by training
 > script will take care of the paths. However, you should carefully supply
 > paths while training your model. Before training your model, you should have
 > two seperate corpus files which are lowercased, sentence aligned and
 > accordingly tokenized (there are supplementary tools for this). Once you
 > have your corpus in two seperate files such as corpus.en, and corpus.fr you
 > will run a training perl script: train-factored-phrase-model.pl with various
 > parameters. If you need further help with this command after installing
 > moses and all training scripts, send me a reply including your exact path
 > for your corpus files and I will try to figure out the training command for
 > your paths.
 >
 > Cheers
 >
 >
 > On 8/13/08, Llio Humphreys <[EMAIL PROTECTED]> wrote:
 > > Hi Murat,
 > > thanks for this.  I've got Ubuntu 8.04 so the Hardy Heron packages are
 > > what I need also
 > > (http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/all/).
 > >
 > > I think I already got the order wrong...(sign of panic maybe?)
 > > I clicked on mckls deb and the package installer said it was already
 > installed.
 > > I clicked on srilm deb and the package installer said it was already
 > > installed, so I clicked Reinstall package.
 > >
 > > I can't find anything that says the order of installation, but note
 > > that the workshop baseline model requires installing giza before mckls
 > > Do I need to uninstall mkcls (if so how? is it just a matter of
 > > deleting the .exc file?) or is it enough to click on Reinstall
 > > package?
 > >
 > > When all this is done, how do I use Moses?  Many of the commands in
 > > the baseline model
 > (http://www.statmt.org/wmt08/baseline.html) require
 > > pathnames to the various scripts and data:  is it necessary to amend
 > > these commands or can I just type 'whereis' command to find what I
 > > need?
 > >
 > > Thanks,
 > > Llio
 > >
 > >
 > > On Wed, Aug 13, 2008 at 1:48 PM, Murat ALPEREN <[EMAIL PROTECTED]>
 > wrote:
 > > > Dear Llio,
 > > >
 > > > Eric's page will probably help you, I have installed pre-compiled debian
 > > > based Ubuntu - Hardy Heron packages. All the necessary binaries are
 > included
 > > > in Eric's repository which will guide you for the dependancies, that
 > means
 > > > there's an order of installation which you should follow. As far as I
 > > > remember you should first install srilm, then mkcls, giza and finally
 > moses.
 > > > Then you will be able to train your models or run any model on your
 > machine
 > > >
 > > > Regards
 > > >
 > > >
 > > > On 8/13/08, Anung Ariwibowo <[EMAIL PROTECTED]> wrote:
 > > >>
 > > >> Hi Llio,
 > > >>
 > > >> I can compile SRILM in Linux Ubuntu without problem. Can you post the
 > > >> error message here, maybe we can help.
 > > >>
 > > >> Cheers,
 > > >> Anung
 > > >>
 > > >> On Wed, Aug 13, 2008 at 8:29 PM, Llio Humphreys <[EMAIL PROTECTED]>
 > > >> wrote:
 > > >>>
 > > >>> Dear Josh/Hieu,
 > > >>> many thanks for your replies.  The default shell is bash, and updating
 > > >>> the .profile file worked - thanks for that tip.  I look forward to
 > > >>> hearing more from you about the ./model/extract.0-0.o.part* problem.
 > > >>> My apologies for my ignorance of Unix matters: I'd like to think of
 > > >>> myself as a newbie rather than one who is averse to learning about
 > > >>> these things, and the further information you have provided has been
 > > >>> useful and interesting.  Hieu mentioned that Anung Ariwibowo got Moses
 > > >>> to work when he transferred to a Linux machine.  A colleague has
 > > >>> kindly let me borrow a Linux/Ubuntu machine, but I have already run
 > > >>> into problems compiling SRILM!  So, I'll see if Eric Nichols's
 > > >>> packages will take care of that:
 > > >>>
 > http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/feisty/nlp/
 > > >>> Best regards,
 > > >>> Llio
 > > >>>
 > > >>>
 > > >>>
 > > >>> On 8/13/08, Josh Schroeder <[EMAIL PROTECTED]> wrote:
 > > >>> > Hi Llio,
 > > >>> >
 > > >>> >
 > > >>> > > you may have already received my email on the following problem
 > when
 > > >>> > > building the language model:
 > > >>> > >
 > > >>> > > Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o
 > > >>> > > cat: ./model/extract.0-0.o.part*: No such file or directory
 > > >>> > > Exit code: 1
 > > >>> > >
 > > >>> >
 > > >>> >  That's building the phrase table, not the language model. It seems
 > > >>> > like
 > > >>> > several people on the list are having problems with this step, so
 > I'm
 > > >>> > going
 > > >>> > to take a look at the training process and post something to the
 > list
 > > >>> > in the
 > > >>> > next day or two.
 > > >>> >
 > > >>> >
 > > >>> > >
 > > >>> > > 1. You mention that Moses does not use environment variables.
 > > >>> > > However, in order to get SRILM to work, I found it necessary to
 > > >>> > > create
 > > >>> > > environment variables and pass these on to SRILM's make:
 > > >>> > >
 > > >>> > > make SRILM=$PWD MACHINE_TYPE=macosx
 > > >>> > >
 > > >>> >
 > > >>> >
 > PATH=/bin:/sbin:/usr/bin:/usr/sbin:/Users/lliohumphreys/MT/MOSESSUITE/srilm:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin/macosx:/sw/bin/gawk
 > > >>> > >
 > MANPATH=/Users/lliohumphreys/MT/MOSESSUITE/srilm/man
 > > >>> > LC_NUMERIC=C
 > > >>> > >
 > > >>> > > In addition, I was also required to type in the following command
 > for
 > > >>> > > moses-scripts:
 > > >>> > >
 > > >>> > > export
 > > >>> >
 > > >>> >
 > SCRIPTS_ROOTDIR=/Users/lliohumphreys/MT/MOSESSUITE/bin/moses-scripts/scripts-20080811-1801
 > > >>> > >
 > > >>> > >
 > > >>> >
 > > >>> >  Sorry, I should have been more clear. Moses itself, the decoder
 > that
 > > >>> > loads
 > > >>> > a trained phrase table and language model and translates text, is a
 > > >>> > self-contained command-line program that doesn't require environment
 > > >>> > variables.
 > > >>> >
 > > >>> >  Your first example is compiling SRILM. This is not part of the
 > Moses
 > > >>> > toolkit: it's a toolkit of its own for language modeling and a ton
 > of
 > > >>> > other
 > > >>> > stuff. We use it as one of two possible integrated language models
 > (the
 > > >>> > other is IRSTLM) with Moses.
 > > >>> >
 > > >>> >  Your second example is part of the training regime. Yes, there is
 > some
 > > >>> > use
 > > >>> > of the SCRIPTS_ROOTDIR in the
 > > >>> > train-factored-phrase-model.perl, but for most
 > training
 > > >>> > support scripts that come with moses there is a flag that lets you
 > > >>> > specify
 > > >>> > SCRIPTS_ROOTDIR at the command line instead of storing it as an
 > > >>> > environment
 > > >>> > variable. In train-factored-phrase-model it's "-scripts-root-dir",
 > > >>> > which I
 > > >>> > think you've actually used in one of your other emails.
 > > >>> >
 > > >>> >
 > > >>> >
 > > >>> > > If I open a new terminal and echo these variables, most of them
 > are
 > > >>> > > blank, and PATH just gives the default bin paths.
 > > >>> > >
 > > >>> > > So, how do I make them permanent?  I assume that if I want to use
 > > >>> > > Moses again, it needs to have access to these variables?  How can
 > I
 > > >>> > > ensure that I can close the terminal, go home, open a new terminal
 > > >>> > > the
 > > >>> > > next day and get Moses working again?  A colleague suggested I
 > update
 > > >>> > > the .bashrc file to update each new terminal session with these
 > > >>> > > environment variables. However, my Mac system does not appear to
 > have
 > > >>> > > a .bashrc system as a default, and when I created one in my home
 > > >>> > > directory and opened a new terminal, it did not access the .bashrc
 > > >>> > > file.
 > > >>> > >
 > > >>> >
 > > >>> >  Here's some info on environment variables on the Mac, found with a
 > > >>> > quick
 > > >>> > Google search:
 > > >>>
 > >  http://www.macdevcenter.com/pub/a/mac/2004/02/24/bash.html
 > > >>> >
 > > >>> >  I tried it with .profile, that worked fine. Are you sure you're set
 > to
 > > >>> > use
 > > >>> > the bash shell? Try ' echo $SHELL ' in Terminal.
 > > >>> >
 > > >>> >
 > > >>> > > 2. You say that you ran the decoder on your laptop just fine, but
 > had
 > > >>> > > to change a few scripts for training.  I have very basic knowledge
 > of
 > > >>> > > Unix systems and installing open-source software: would it be
 > > >>> > > possible
 > > >>> > > for you to detail the changes you did to the scripts to get it to
 > run
 > > >>> > > on a Mac?  Although I need this information urgently, it may also
 > be
 > > >>> > > useful for other students who are installing Moses on a Mac and
 > who
 > > >>> > > may also have basic knowledge of Unix installation procedures.
 > > >>> > >
 > > >>> >
 > > >>> >  I'll look into this. Mac isn't really the platform of choice for
 > > >>> > training a
 > > >>> > Moses model and I do most of my work on linux. If I recall
 > correctly,
 > > >>> > an
 > > >>> > Intel-based Mac should be easier to get working than a PowerPC one.
 > The
 > > >>> > *decoder* does work on my Intel-based laptop, but I haven't run a
 > full
 > > >>> > training setup locally in some time -- most of the time we're
 > working
 > > >>> > with
 > > >>> > so much data that I use a cluster of linux machines instead of my
 > Mac.
 > > >>> >
 > > >>> >  As a word of caution: Moses isn't an out-of-the box translation
 > > >>> > solution
 > > >>> > for end users. It's research software undergoing active development,
 > so
 > > >>> > almost every user -- on any platform --  will need to muck around in
 > > >>> > the
 > > >>> > scripts at some point, or face a compile error or runtime crash. The
 > > >>> > ability
 > > >>> > to deal with unix/linux command line tools, and debug code and
 > scripts
 > > >>> > when
 > > >>> > necessary, is really important. That being said, I'll see what I can
 > do
 > > >>> > about highlighting where the scripts might have problems on the Mac.
 > > >>> >
 > > >>> >
 > > >>> > > 3. My final question: which is embarrasingly basic...can I use the
 > > >>> > > one
 > > >>> > > installation of Moses for different corpora, or do I need to do a
 > > >>> > > separate installation for each one?  Can I have separate
 > > >>> > > installations
 > > >>> > > of SRILM, Giza and mckls, or should they all reference the same
 > > >>> > > libraries?
 > > >>> > >
 > > >>> >
 > > >>> >  All you need to do to have moses use different corpora is point it
 > to
 > > >>> > a
 > > >>> > different moses.ini file. Assuming you have compiled moses with
 > support
 > > >>> > for
 > > >>> > the language model specified in the file (IRSTLM or SRILM), it will
 > > >>> > translate. You should only need one copy of giza, mkcls, irst/srilm,
 > > >>> > and
 > > >>> > moses. The code stays the same, it's the data model that's
 > different.
 > > >>> >
 > > >>> >  -Josh
 > > >>> >
 > > >>> >
 > > >>> >
 > > >>> >  --
 > > >>> >  The University of Edinburgh is a charitable body, registered in
 > > >>> >  Scotland, with registration number SC005336.
 > > >>> >
 > > >>> >
 > > >>> _______________________________________________
 > > >>> Moses-support mailing list
 > > >>> [email protected]
 > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support
 > > >>>
 > > >>
 > > >>
 > > >> --
 > > >> barliant at {gmail.com, yahoo.com}
 > > >> Starting July 2008, barliant at cbn.net.id is no longer active
 > > >> Visit my Blog at barliant dot blogspot dot com
 > > >>
 > > >> _______________________________________________
 > > >> Moses-support mailing list
 > > >> [email protected]
 > > >> http://mailman.mit.edu/mailman/listinfo/moses-support
 > > >>
 > > >
 > > >
 > >
 >
 >
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

Reply via email to