building language models (using for example ngram-count) is computationally expensive. from what you tell the list, it seems that you don't have enough physical memory to run it properly.
you have a number of options: --specify a lower order model (eg 4 rather than 5, or even 3); depending upon how much monolingual training material you have, this may not produce worse results and it will certainly run faster and will require less space. --divide your language model training material into chunks and run ngram-count on each chunk. this is one strategy for building LMs using all of the Giga word corpus (when you don't have access to a 64 bit machine). here you would create multiple LMs. --use a disk-based method of creating them. we have done this, and basically it trades speed for time. --take the radical option and simply don't bother smoothing at all (ie use Google's "stupid backoff"). this makes training LMs trivial --just compute the counts of ngrams and work-out how to store them. i reckon it should be possible to do this and create an ARPA file suitable for loading into the SRILM. --buy more machines. Miles 2008/8/14 Llio Humphreys <[EMAIL PROTECTED]> > Dear Murat, Anung, Hieu, Josh, Eric, Miles, Sara, Amittai, > thank you all for your help. It is very, very much appreciated. I > decided to try Eric's packages, and it looks like the installation > worked. I typed some of the > commands in the Baseline instructions without arguments, and the > program either output to the screen that I missed some arguments or > gave a description of the program. Thank you Eric!!! > > Following the Baseline instructions > (http://www.statmt.org/wmt08/baseline.html) I have now got to the > following step: > > Use SRILM to build language model: > /path-to-srilm/bin/i686/ngram-count -order 5 -interpolate -kndiscount > -text working-dir/lm/europarl.lowercased -lm > working-dir/lm/europarl.lm > > In my case, I was in folder home/llio/MOSESMTDATA. I didn't know the > path to ngram-count, but it was possible to invoke it without the > path: > > ngram-count -order 5 -interpolate -kndiscount -text > europarl/lm/europarl.lowercased -lm europarl/lm/europarl.lm > > I'm concerned about two things: > 1) this ngram-count step is taking a very long time. I think I started > it off around 6pm yesterday, but it's still going. It's very > resource-intensive, and it's difficult to get to other windows open. > I went to check up on it around 9pm, and couldn't find that particular > terminal. I thought I had closed that terminal by mistake, so I stupidly > opened another one, and entered the same command. I subsequently > found that the original terminal was still open, so I closed the > second one. I'm not sure if issuing this command a second time on the > same program and files on a different terminal would corrupt the > original ngramcount step, and whether I should start it off again, or > whether starting it off again would make things worse? I looked up > ngram-count ( > http://www.speech.sri.com/projects/srilm/manpages/ngram-count.1.html) > and I don't think it outputs to any file, so I guess you have to be in > the same terminal to do the next step? I opened > another terminal and typed 'top' to see what processes are running, > and I know that ngram-count is doing something, but whether it's doing > well or stuck in a loop, I can't say. What I do find strange is that > the time for ngram-count is said to be 00:58:20, and it's been going > for hours.. I searched this problem in previous Moses Group emails and > I understand that if I run this with order 4 instead of 5 it will run > quicker with very similar results? So, can I just stop what it's > doing, and run this command in the same terminal with order 4? Are > there any files I need to 'touch' to ensure that it doesn't leave any > stone unturned? > > 2) how to do the next step: > > > > bin/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl > -scripts-root-dir bin/moses-scripts/scripts-YYYYMMDD-HHMM -root-dir > working-dir -corpus working-dir/corpus/europarl.lowercased -f fr -e en > -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm > 0:5:working-dir/lm/europarl.lm:0 > > I assume that like ngram-count, I can just type in > train-factored-phrase-model.perl without the full path...Do I need to > set the -scripts-root-dir paramater? Are all the scripts in the same > place? > > Thank you, > > Llio > > > > > On 8/14/08, Murat ALPEREN <[EMAIL PROTECTED]> wrote: > > Dear Llio, > > > > You should be okay with installing moses finally if you have installed > all > > tha dependant packages before. I am not aware of the 'whereis' command, > but > > once you train your model, your moses.ini file which is created by > training > > script will take care of the paths. However, you should carefully supply > > paths while training your model. Before training your model, you should > have > > two seperate corpus files which are lowercased, sentence aligned and > > accordingly tokenized (there are supplementary tools for this). Once you > > have your corpus in two seperate files such as corpus.en, and corpus.fryou > > will run a training perl script: train-factored-phrase-model.pl with > various > > parameters. If you need further help with this command after installing > > moses and all training scripts, send me a reply including your exact > path > > for your corpus files and I will try to figure out the training command > for > > your paths. > > > > Cheers > > > > > > On 8/13/08, Llio Humphreys <[EMAIL PROTECTED]> wrote: > > > Hi Murat, > > > thanks for this. I've got Ubuntu 8.04 so the Hardy Heron packages are > > > what I need also > > > > (http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/hardy/all/<http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/hardy/all/> > ). > > > > > > I think I already got the order wrong...(sign of panic maybe?) > > > I clicked on mckls deb and the package installer said it was already > > installed. > > > I clicked on srilm deb and the package installer said it was already > > > installed, so I clicked Reinstall package. > > > > > > I can't find anything that says the order of installation, but note > > > that the workshop baseline model requires installing giza before mckls > > > Do I need to uninstall mkcls (if so how? is it just a matter of > > > deleting the .exc file?) or is it enough to click on Reinstall > > > package? > > > > > > When all this is done, how do I use Moses? Many of the commands in > > > the baseline model > > (http://www.statmt.org/wmt08/baseline.html) require > > > pathnames to the various scripts and data: is it necessary to amend > > > these commands or can I just type 'whereis' command to find what I > > > need? > > > > > > Thanks, > > > Llio > > > > > > > > > On Wed, Aug 13, 2008 at 1:48 PM, Murat ALPEREN <[EMAIL PROTECTED]> > > wrote: > > > > Dear Llio, > > > > > > > > Eric's page will probably help you, I have installed pre-compiled > debian > > > > based Ubuntu - Hardy Heron packages. All the necessary binaries are > > included > > > > in Eric's repository which will guide you for the dependancies, that > > means > > > > there's an order of installation which you should follow. As far as > I > > > > remember you should first install srilm, then mkcls, giza and > finally > > moses. > > > > Then you will be able to train your models or run any model on your > > machine > > > > > > > > Regards > > > > > > > > > > > > On 8/13/08, Anung Ariwibowo <[EMAIL PROTECTED]> wrote: > > > >> > > > >> Hi Llio, > > > >> > > > >> I can compile SRILM in Linux Ubuntu without problem. Can you post > the > > > >> error message here, maybe we can help. > > > >> > > > >> Cheers, > > > >> Anung > > > >> > > > >> On Wed, Aug 13, 2008 at 8:29 PM, Llio Humphreys < > [EMAIL PROTECTED]> > > > >> wrote: > > > >>> > > > >>> Dear Josh/Hieu, > > > >>> many thanks for your replies. The default shell is bash, and > updating > > > >>> the .profile file worked - thanks for that tip. I look forward to > > > >>> hearing more from you about the ./model/extract.0-0.o.part* > problem. > > > >>> My apologies for my ignorance of Unix matters: I'd like to think > of > > > >>> myself as a newbie rather than one who is averse to learning about > > > >>> these things, and the further information you have provided has > been > > > >>> useful and interesting. Hieu mentioned that Anung Ariwibowo got > Moses > > > >>> to work when he transferred to a Linux machine. A colleague has > > > >>> kindly let me borrow a Linux/Ubuntu machine, but I have already > run > > > >>> into problems compiling SRILM! So, I'll see if Eric Nichols's > > > >>> packages will take care of that: > > > >>> > > > http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/feisty/nlp/<http://cl.naist.jp/%7Eeric-n/ubuntu-nlp/dists/feisty/nlp/> > > > >>> Best regards, > > > >>> Llio > > > >>> > > > >>> > > > >>> > > > >>> On 8/13/08, Josh Schroeder <[EMAIL PROTECTED]> wrote: > > > >>> > Hi Llio, > > > >>> > > > > >>> > > > > >>> > > you may have already received my email on the following > problem > > when > > > >>> > > building the language model: > > > >>> > > > > > >>> > > Executing: cat ./model/extract.0-0.o.part* > > ./model/extract.0-0.o > > > >>> > > cat: ./model/extract.0-0.o.part*: No such file or directory > > > >>> > > Exit code: 1 > > > >>> > > > > > >>> > > > > >>> > That's building the phrase table, not the language model. It > seems > > > >>> > like > > > >>> > several people on the list are having problems with this step, > so > > I'm > > > >>> > going > > > >>> > to take a look at the training process and post something to the > > list > > > >>> > in the > > > >>> > next day or two. > > > >>> > > > > >>> > > > > >>> > > > > > >>> > > 1. You mention that Moses does not use environment variables. > > > >>> > > However, in order to get SRILM to work, I found it necessary > to > > > >>> > > create > > > >>> > > environment variables and pass these on to SRILM's make: > > > >>> > > > > > >>> > > make SRILM=$PWD MACHINE_TYPE=macosx > > > >>> > > > > > >>> > > > > >>> > > > > PATH=/bin:/sbin:/usr/bin:/usr/sbin:/Users/lliohumphreys/MT/MOSESSUITE/srilm:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin:/Users/lliohumphreys/MT/MOSESSUITE/srilm/bin/macosx:/sw/bin/gawk > > > >>> > > > > MANPATH=/Users/lliohumphreys/MT/MOSESSUITE/srilm/man > > > >>> > LC_NUMERIC=C > > > >>> > > > > > >>> > > In addition, I was also required to type in the following > command > > for > > > >>> > > moses-scripts: > > > >>> > > > > > >>> > > export > > > >>> > > > > >>> > > > > SCRIPTS_ROOTDIR=/Users/lliohumphreys/MT/MOSESSUITE/bin/moses-scripts/scripts-20080811-1801 > > > >>> > > > > > >>> > > > > > >>> > > > > >>> > Sorry, I should have been more clear. Moses itself, the decoder > > that > > > >>> > loads > > > >>> > a trained phrase table and language model and translates text, > is a > > > >>> > self-contained command-line program that doesn't require > environment > > > >>> > variables. > > > >>> > > > > >>> > Your first example is compiling SRILM. This is not part of the > > Moses > > > >>> > toolkit: it's a toolkit of its own for language modeling and a > ton > > of > > > >>> > other > > > >>> > stuff. We use it as one of two possible integrated language > models > > (the > > > >>> > other is IRSTLM) with Moses. > > > >>> > > > > >>> > Your second example is part of the training regime. Yes, there > is > > some > > > >>> > use > > > >>> > of the SCRIPTS_ROOTDIR in the > > > >>> > train-factored-phrase-model.perl, but for most > > training > > > >>> > support scripts that come with moses there is a flag that lets > you > > > >>> > specify > > > >>> > SCRIPTS_ROOTDIR at the command line instead of storing it as an > > > >>> > environment > > > >>> > variable. In train-factored-phrase-model it's > "-scripts-root-dir", > > > >>> > which I > > > >>> > think you've actually used in one of your other emails. > > > >>> > > > > >>> > > > > >>> > > > > >>> > > If I open a new terminal and echo these variables, most of > them > > are > > > >>> > > blank, and PATH just gives the default bin paths. > > > >>> > > > > > >>> > > So, how do I make them permanent? I assume that if I want to > use > > > >>> > > Moses again, it needs to have access to these variables? How > can > > I > > > >>> > > ensure that I can close the terminal, go home, open a new > terminal > > > >>> > > the > > > >>> > > next day and get Moses working again? A colleague suggested I > > update > > > >>> > > the .bashrc file to update each new terminal session with > these > > > >>> > > environment variables. However, my Mac system does not appear > to > > have > > > >>> > > a .bashrc system as a default, and when I created one in my > home > > > >>> > > directory and opened a new terminal, it did not access the > .bashrc > > > >>> > > file. > > > >>> > > > > > >>> > > > > >>> > Here's some info on environment variables on the Mac, found > with a > > > >>> > quick > > > >>> > Google search: > > > >>> > > > http://www.macdevcenter.com/pub/a/mac/2004/02/24/bash.html > > > >>> > > > > >>> > I tried it with .profile, that worked fine. Are you sure you're > set > > to > > > >>> > use > > > >>> > the bash shell? Try ' echo $SHELL ' in Terminal. > > > >>> > > > > >>> > > > > >>> > > 2. You say that you ran the decoder on your laptop just fine, > but > > had > > > >>> > > to change a few scripts for training. I have very basic > knowledge > > of > > > >>> > > Unix systems and installing open-source software: would it be > > > >>> > > possible > > > >>> > > for you to detail the changes you did to the scripts to get it > to > > run > > > >>> > > on a Mac? Although I need this information urgently, it may > also > > be > > > >>> > > useful for other students who are installing Moses on a Mac > and > > who > > > >>> > > may also have basic knowledge of Unix installation procedures. > > > >>> > > > > > >>> > > > > >>> > I'll look into this. Mac isn't really the platform of choice > for > > > >>> > training a > > > >>> > Moses model and I do most of my work on linux. If I recall > > correctly, > > > >>> > an > > > >>> > Intel-based Mac should be easier to get working than a PowerPC > one. > > The > > > >>> > *decoder* does work on my Intel-based laptop, but I haven't run > a > > full > > > >>> > training setup locally in some time -- most of the time we're > > working > > > >>> > with > > > >>> > so much data that I use a cluster of linux machines instead of > my > > Mac. > > > >>> > > > > >>> > As a word of caution: Moses isn't an out-of-the box translation > > > >>> > solution > > > >>> > for end users. It's research software undergoing active > development, > > so > > > >>> > almost every user -- on any platform -- will need to muck > around in > > > >>> > the > > > >>> > scripts at some point, or face a compile error or runtime crash. > The > > > >>> > ability > > > >>> > to deal with unix/linux command line tools, and debug code and > > scripts > > > >>> > when > > > >>> > necessary, is really important. That being said, I'll see what I > can > > do > > > >>> > about highlighting where the scripts might have problems on the > Mac. > > > >>> > > > > >>> > > > > >>> > > 3. My final question: which is embarrasingly basic...can I use > the > > > >>> > > one > > > >>> > > installation of Moses for different corpora, or do I need to > do a > > > >>> > > separate installation for each one? Can I have separate > > > >>> > > installations > > > >>> > > of SRILM, Giza and mckls, or should they all reference the > same > > > >>> > > libraries? > > > >>> > > > > > >>> > > > > >>> > All you need to do to have moses use different corpora is point > it > > to > > > >>> > a > > > >>> > different moses.ini file. Assuming you have compiled moses with > > support > > > >>> > for > > > >>> > the language model specified in the file (IRSTLM or SRILM), it > will > > > >>> > translate. You should only need one copy of giza, mkcls, > irst/srilm, > > > >>> > and > > > >>> > moses. The code stays the same, it's the data model that's > > different. > > > >>> > > > > >>> > -Josh > > > >>> > > > > >>> > > > > >>> > > > > >>> > -- > > > >>> > The University of Edinburgh is a charitable body, registered in > > > >>> > Scotland, with registration number SC005336. > > > >>> > > > > >>> > > > > >>> _______________________________________________ > > > >>> Moses-support mailing list > > > >>> [email protected] > > > >>> http://mailman.mit.edu/mailman/listinfo/moses-support > > > >>> > > > >> > > > >> > > > >> -- > > > >> barliant at {gmail.com, yahoo.com} > > > >> Starting July 2008, barliant at cbn.net.id is no longer active > > > >> Visit my Blog at barliant dot blogspot dot com > > > >> > > > >> _______________________________________________ > > > >> Moses-support mailing list > > > >> [email protected] > > > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > >> > > > > > > > > > > > > > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
