Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

Miles Osborne Thu, 14 Aug 2008 05:24:15 -0700

building language models (using for example ngram-count) is computationally
expensive.  from what you tell the list, it seems that you don't have enough
physical memory to run it properly.


you have a number of options:

--specify a lower order model (eg 4 rather than 5, or even 3);  depending
upon how much monolingual training material you have, this may not produce
worse results  and it will certainly run faster and will require less space.

--divide your language model training material into chunks and run
ngram-count on each chunk.  this is one strategy for building LMs using all
of the Giga word corpus (when you don't have access to a 64 bit machine).
here you would create multiple LMs.

--use a disk-based method of creating them.  we have done this, and
basically it trades speed for time.

--take the radical option and simply don't bother smoothing at all (ie use
Google's "stupid backoff").  this makes training LMs trivial --just compute
the counts of ngrams and work-out how to store them.  i reckon it should be
possible to do this and create an ARPA file suitable for loading into the
SRILM.

--buy more machines.

Miles

2008/8/14 Llio Humphreys <[EMAIL PROTECTED]>

> Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai,
> thank you all for your help.  It is very, very much appreciated. I
> decided to try Eric's packages, and it looks like the installation
> worked.  I typed some of the
>  commands in the Baseline instructions without arguments, and the
>  program either output to the screen that I missed some arguments or
>  gave a description of the program.  Thank you Eric!!!
>
>  Following the Baseline instructions
>  (http://www.statmt.org/wmt08/baseline.html) I have now got to the
>  following step:
>
>  Use SRILM to build language model:
>  /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount
>  -text working-dir/lm/europarl.lowercased -lm
>  working-dir/lm/europarl.lm
>
>  In my case, I was in folder home/llio/MOSESMTDATA.  I didn't know the
>  path to ngram-count, but it was possible to invoke it without the
>  path:
>
>  ngram-count -order 5 -interpolate -kndiscount -text
>  europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm
>
>  I'm concerned about two things:
>  1) this ngram-count step is taking a very long time.  I think I started
>  it off around 6pm yesterday, but it's still going.  It's very
>  resource-intensive, and it's difficult to get to  other windows open.
>  I went to check up on it around 9pm, and couldn't find that particular
>  terminal.  I thought I had closed that terminal by mistake, so I stupidly
>  opened another one, and entered the same command.  I subsequently
>  found that the original terminal was still open, so I closed the
>  second one.  I'm not sure if issuing this command a second time on the
>  same program and files on a different terminal would corrupt the
>  original ngramcount step, and whether I should start it off again, or
>  whether starting it off again would make things worse?   I looked up
>  ngram-count (
> http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html)
>  and I don't think it outputs to any file, so I guess you have to be in
>  the same terminal to do the next step?  I opened
>  another terminal and typed 'top' to see what processes are running,
>  and I know that ngram-count is doing something, but whether it's doing
>  well or stuck in a loop, I can't say.  What I do find strange is that
> the time for ngram-count is said to be 00:58:20, and it's been going
> for hours.. I searched this problem in previous Moses Group emails and
> I understand that if I run this with order 4 instead of 5 it will run
> quicker with very similar results?  So, can I just stop what it's
> doing, and run this command in the same terminal with order 4?  Are
> there any files I need to 'touch' to ensure that it doesn't leave any
> stone unturned?
>
>  2) how to do the next step:
>
>
>  
> bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl
>  -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir
>  working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en
>  -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm
>  0:5:working-dir/lm/europarl.lm:0
>
> I assume that like ngram-count, I can just type in
> train-factored-phrase-model.perl without the full path...Do I need to
> set the -scripts-root-dir paramater?  Are all the scripts in the same
> place?
>
> Thank you,
>
> Llio
>
>
>
>
>  On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote:
>  > Dear Llio,
>  >
>  > You should be okay with installing moses finally if you have installed
> all
>  > tha dependant packages before. I am not aware of the 'whereis' command,
> but
>  > once you train your model, your moses.ini file which is created by
> training
>  > script will take care of the paths. However, you should carefully supply
>  > paths while training your model. Before training your model, you should
> have
>  > two seperate corpus files which are lowercased, sentence aligned and
>  > accordingly tokenized (there are supplementary tools for this). Once you
>  > have your corpus in two seperate files such as corpus.en, and corpus.fryou
>  > will run a training perl script: train-factored-phrase-model.pl with
> various
>  > parameters. If you need further help with this command after installing
>  > moses and all training scripts, send me a reply including your exact
> path
>  > for your corpus files and I will try to figure out the training command
> for
>  > your paths.
>  >
>  > Cheers
>  >
>  >
>  > On 8/13/08, Llio Humphreys <[EMAIL PROTECTED]> wrote:
>  > > Hi Murat,
>  > > thanks for this.  I've got Ubuntu 8.04 so the Hardy Heron packages are
>  > > what I need also
>  > > 
> (http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/all/<http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/hardy/all/>
> ).
>  > >
>  > > I think I already got the order wrong...(sign of panic maybe?)
>  > > I clicked on mckls deb and the package installer said it was already
>  > installed.
>  > > I clicked on srilm deb and the package installer said it was already
>  > > installed, so I clicked Reinstall package.
>  > >
>  > > I can't find anything that says the order of installation, but note
>  > > that the workshop baseline model requires installing giza before mckls
>  > > Do I need to uninstall mkcls (if so how? is it just a matter of
>  > > deleting the .exc file?) or is it enough to click on Reinstall
>  > > package?
>  > >
>  > > When all this is done, how do I use Moses?  Many of the commands in
>  > > the baseline model
>  > (http://www.statmt.org/wmt08/baseline.html) require
>  > > pathnames to the various scripts and data:  is it necessary to amend
>  > > these commands or can I just type 'whereis' command to find what I
>  > > need?
>  > >
>  > > Thanks,
>  > > Llio
>  > >
>  > >
>  > > On Wed, Aug 13, 2008 at 1:48 PM, Murat ALPEREN <[EMAIL PROTECTED]>
>  > wrote:
>  > > > Dear Llio,
>  > > >
>  > > > Eric's page will probably help you, I have installed pre-compiled
> debian
>  > > > based Ubuntu - Hardy Heron packages. All the necessary binaries are
>  > included
>  > > > in Eric's repository which will guide you for the dependancies, that
>  > means
>  > > > there's an order of installation which you should follow. As far as
> I
>  > > > remember you should first install srilm, then mkcls, giza and
> finally
>  > moses.
>  > > > Then you will be able to train your models or run any model on your
>  > machine
>  > > >
>  > > > Regards
>  > > >
>  > > >
>  > > > On 8/13/08, Anung Ariwibowo <[EMAIL PROTECTED]> wrote:
>  > > >>
>  > > >> Hi Llio,
>  > > >>
>  > > >> I can compile SRILM in Linux Ubuntu without problem. Can you post
> the
>  > > >> error message here, maybe we can help.
>  > > >>
>  > > >> Cheers,
>  > > >> Anung
>  > > >>
>  > > >> On Wed, Aug 13, 2008 at 8:29 PM, Llio Humphreys <
> [EMAIL PROTECTED]>
>  > > >> wrote:
>  > > >>>
>  > > >>> Dear Josh/Hieu,
>  > > >>> many thanks for your replies.  The default shell is bash, and
> updating
>  > > >>> the .profile file worked - thanks for that tip.  I look forward to
>  > > >>> hearing more from you about the ./model/extract.0-0.o.part*
> problem.
>  > > >>> My apologies for my ignorance of Unix matters: I'd like to think
> of
>  > > >>> myself as a newbie rather than one who is averse to learning about
>  > > >>> these things, and the further information you have provided has
> been
>  > > >>> useful and interesting.  Hieu mentioned that Anung Ariwibowo got
> Moses
>  > > >>> to work when he transferred to a Linux machine.  A colleague has
>  > > >>> kindly let me borrow a Linux/Ubuntu machine, but I have already
> run
>  > > >>> into problems compiling SRILM!  So, I'll see if Eric Nichols's
>  > > >>> packages will take care of that:
>  > > >>>
>  > 
> http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/feisty/nlp/<http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/feisty/nlp/>
>  > > >>> Best regards,
>  > > >>> Llio
>  > > >>>
>  > > >>>
>  > > >>>
>  > > >>> On 8/13/08, Josh Schroeder <[EMAIL PROTECTED]> wrote:
>  > > >>> > Hi Llio,
>  > > >>> >
>  > > >>> >
>  > > >>> > > you may have already received my email on the following
> problem
>  > when
>  > > >>> > > building the language model:
>  > > >>> > >
>  > > >>> > > Executing: cat ./model/extract.0-0.o.part* >
> ./model/extract.0-0.o
>  > > >>> > > cat: ./model/extract.0-0.o.part*: No such file or directory
>  > > >>> > > Exit code: 1
>  > > >>> > >
>  > > >>> >
>  > > >>> >  That's building the phrase table, not the language model. It
> seems
>  > > >>> > like
>  > > >>> > several people on the list are having problems with this step,
> so
>  > I'm
>  > > >>> > going
>  > > >>> > to take a look at the training process and post something to the
>  > list
>  > > >>> > in the
>  > > >>> > next day or two.
>  > > >>> >
>  > > >>> >
>  > > >>> > >
>  > > >>> > > 1. You mention that Moses does not use environment variables.
>  > > >>> > > However, in order to get SRILM to work, I found it necessary
> to
>  > > >>> > > create
>  > > >>> > > environment variables and pass these on to SRILM's make:
>  > > >>> > >
>  > > >>> > > make SRILM=$PWD MACHINE_TYPE=macosx
>  > > >>> > >
>  > > >>> >
>  > > >>> >
>  >
> PATH=/bin:/sbin:/usr/bin:/usr/sbin:/Users/lliohumphreys/MT/MOSESSUITE/srilm:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin/macosx:/sw/bin/gawk
>  > > >>> > >
>  > MANPATH=/Users/lliohumphreys/MT/MOSESSUITE/srilm/man
>  > > >>> > LC_NUMERIC=C
>  > > >>> > >
>  > > >>> > > In addition, I was also required to type in the following
> command
>  > for
>  > > >>> > > moses-scripts:
>  > > >>> > >
>  > > >>> > > export
>  > > >>> >
>  > > >>> >
>  >
> SCRIPTS_ROOTDIR=/Users/lliohumphreys/MT/MOSESSUITE/bin/moses-scripts/scripts-20080811-1801
>  > > >>> > >
>  > > >>> > >
>  > > >>> >
>  > > >>> >  Sorry, I should have been more clear. Moses itself, the decoder
>  > that
>  > > >>> > loads
>  > > >>> > a trained phrase table and language model and translates text,
> is a
>  > > >>> > self-contained command-line program that doesn't require
> environment
>  > > >>> > variables.
>  > > >>> >
>  > > >>> >  Your first example is compiling SRILM. This is not part of the
>  > Moses
>  > > >>> > toolkit: it's a toolkit of its own for language modeling and a
> ton
>  > of
>  > > >>> > other
>  > > >>> > stuff. We use it as one of two possible integrated language
> models
>  > (the
>  > > >>> > other is IRSTLM) with Moses.
>  > > >>> >
>  > > >>> >  Your second example is part of the training regime. Yes, there
> is
>  > some
>  > > >>> > use
>  > > >>> > of the SCRIPTS_ROOTDIR in the
>  > > >>> > train-factored-phrase-model.perl, but for most
>  > training
>  > > >>> > support scripts that come with moses there is a flag that lets
> you
>  > > >>> > specify
>  > > >>> > SCRIPTS_ROOTDIR at the command line instead of storing it as an
>  > > >>> > environment
>  > > >>> > variable. In train-factored-phrase-model it's
> "-scripts-root-dir",
>  > > >>> > which I
>  > > >>> > think you've actually used in one of your other emails.
>  > > >>> >
>  > > >>> >
>  > > >>> >
>  > > >>> > > If I open a new terminal and echo these variables, most of
> them
>  > are
>  > > >>> > > blank, and PATH just gives the default bin paths.
>  > > >>> > >
>  > > >>> > > So, how do I make them permanent?  I assume that if I want to
> use
>  > > >>> > > Moses again, it needs to have access to these variables?  How
> can
>  > I
>  > > >>> > > ensure that I can close the terminal, go home, open a new
> terminal
>  > > >>> > > the
>  > > >>> > > next day and get Moses working again?  A colleague suggested I
>  > update
>  > > >>> > > the .bashrc file to update each new terminal session with
> these
>  > > >>> > > environment variables. However, my Mac system does not appear
> to
>  > have
>  > > >>> > > a .bashrc system as a default, and when I created one in my
> home
>  > > >>> > > directory and opened a new terminal, it did not access the
> .bashrc
>  > > >>> > > file.
>  > > >>> > >
>  > > >>> >
>  > > >>> >  Here's some info on environment variables on the Mac, found
> with a
>  > > >>> > quick
>  > > >>> > Google search:
>  > > >>>
>  > >  http://www.macdevcenter.com/pub/a/mac/2004/02/24/bash.html
>  > > >>> >
>  > > >>> >  I tried it with .profile, that worked fine. Are you sure you're
> set
>  > to
>  > > >>> > use
>  > > >>> > the bash shell? Try ' echo $SHELL ' in Terminal.
>  > > >>> >
>  > > >>> >
>  > > >>> > > 2. You say that you ran the decoder on your laptop just fine,
> but
>  > had
>  > > >>> > > to change a few scripts for training.  I have very basic
> knowledge
>  > of
>  > > >>> > > Unix systems and installing open-source software: would it be
>  > > >>> > > possible
>  > > >>> > > for you to detail the changes you did to the scripts to get it
> to
>  > run
>  > > >>> > > on a Mac?  Although I need this information urgently, it may
> also
>  > be
>  > > >>> > > useful for other students who are installing Moses on a Mac
> and
>  > who
>  > > >>> > > may also have basic knowledge of Unix installation procedures.
>  > > >>> > >
>  > > >>> >
>  > > >>> >  I'll look into this. Mac isn't really the platform of choice
> for
>  > > >>> > training a
>  > > >>> > Moses model and I do most of my work on linux. If I recall
>  > correctly,
>  > > >>> > an
>  > > >>> > Intel-based Mac should be easier to get working than a PowerPC
> one.
>  > The
>  > > >>> > *decoder* does work on my Intel-based laptop, but I haven't run
> a
>  > full
>  > > >>> > training setup locally in some time -- most of the time we're
>  > working
>  > > >>> > with
>  > > >>> > so much data that I use a cluster of linux machines instead of
> my
>  > Mac.
>  > > >>> >
>  > > >>> >  As a word of caution: Moses isn't an out-of-the box translation
>  > > >>> > solution
>  > > >>> > for end users. It's research software undergoing active
> development,
>  > so
>  > > >>> > almost every user -- on any platform --  will need to muck
> around in
>  > > >>> > the
>  > > >>> > scripts at some point, or face a compile error or runtime crash.
> The
>  > > >>> > ability
>  > > >>> > to deal with unix/linux command line tools, and debug code and
>  > scripts
>  > > >>> > when
>  > > >>> > necessary, is really important. That being said, I'll see what I
> can
>  > do
>  > > >>> > about highlighting where the scripts might have problems on the
> Mac.
>  > > >>> >
>  > > >>> >
>  > > >>> > > 3. My final question: which is embarrasingly basic...can I use
> the
>  > > >>> > > one
>  > > >>> > > installation of Moses for different corpora, or do I need to
> do a
>  > > >>> > > separate installation for each one?  Can I have separate
>  > > >>> > > installations
>  > > >>> > > of SRILM, Giza and mckls, or should they all reference the
> same
>  > > >>> > > libraries?
>  > > >>> > >
>  > > >>> >
>  > > >>> >  All you need to do to have moses use different corpora is point
> it
>  > to
>  > > >>> > a
>  > > >>> > different moses.ini file. Assuming you have compiled moses with
>  > support
>  > > >>> > for
>  > > >>> > the language model specified in the file (IRSTLM or SRILM), it
> will
>  > > >>> > translate. You should only need one copy of giza, mkcls,
> irst/srilm,
>  > > >>> > and
>  > > >>> > moses. The code stays the same, it's the data model that's
>  > different.
>  > > >>> >
>  > > >>> >  -Josh
>  > > >>> >
>  > > >>> >
>  > > >>> >
>  > > >>> >  --
>  > > >>> >  The University of Edinburgh is a charitable body, registered in
>  > > >>> >  Scotland, with registration number SC005336.
>  > > >>> >
>  > > >>> >
>  > > >>> _______________________________________________
>  > > >>> Moses-support mailing list
>  > > >>> [email protected]
>  > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support
>  > > >>>
>  > > >>
>  > > >>
>  > > >> --
>  > > >> barliant at {gmail.com, yahoo.com}
>  > > >> Starting July 2008, barliant at cbn.net.id is no longer active
>  > > >> Visit my Blog at barliant dot blogspot dot com
>  > > >>
>  > > >> _______________________________________________
>  > > >> Moses-support mailing list
>  > > >> [email protected]
>  > > >> http://mailman.mit.edu/mailman/listinfo/moses-support
>  > > >>
>  > > >
>  > > >
>  > >
>  >
>  >
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model

Reply via email to