Re: [Moses-support] EMS set up with mgiza and KenLM

Hieu Hoang Tue, 26 Nov 2013 07:43:44 -0800

delete and rerun again. Put also delete
  TRAINING_create-config*


On 26 November 2013 15:31, Daniel Valenzuela <dan...@valenzuela.de> wrote:

>   Yes I already added in further workarounds type=8.
>
>  To be sure I continued clean by
>  rm -r tuning/
>  rm steps/1/TUNING*
>
>  .../experiment.perl -continue 1 -exec
>
>  same output as before.
>
>  Then I continued even cleaner by
>  rm -r tuning/
> rm steps/1/TUNING*
>  rm -r evaluation/newstest2010.filtered.1/
> (there is nothing more *filtered.* in here)
>  .../experiment.perl -continue 1 -exec
>  and the output is the same except for evaluation/newstest2010.filtered.1/
> is missing.
>
>  But still I get a crash at the same TUNING:tune step.
>
>  My [LM] section looks like
>  [LM]
>
> lmplz = $moses-bin-dir/lmplz
> order = 3
> settings = "-T $working-dir/tmp -S 10G"
> lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz $lmplz"
> lm-binarizer = $moses-bin-dir/build_binary
> type = 8
>
>  Crash is still:
>  line=IRSTLM name=LM0 factor=0
> path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
> order=3
> Exception: Error: 4 number of threads specified but IRST LM is not
> threadsafe.
> Exit code: 1
> Failed to run moses with the config
> /home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1 at
> /home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
> cp: cannot stat
> ‘/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini’: No such
> file or directory
>
>  Thank you
>
> > Message: 1
> > Date: Tue, 26 Nov 2013 13:03:03 +0000
> > From: Hieu Hoang <hieuho...@gmail.com>
> > Subject: Re: [Moses-support] EMS set up with mgiza and KenLM
> > To: moses-support@mit.edu
> > Message-ID: <52949c07.3050...@gmail.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > in the [LM] section, you have to put
> > type = 8
> > otherwise the moses.ini will be created to use IRSTLM
> >
> > You have to delete the filtering directory
> > tuning/filtered.?
> > evaluation/*.filtered.?
> > and delete the tuning sh file
> > steps/?/TUNING_tune.*
> >
> > then continue the experiment
> > .../experiment.perl -exec -continue=?
> >
> > On 26/11/2013 12:08, Daniel Valenzuela wrote:
> > > Dear all,
> > > after various manual set ups, I wanted to try the EMS. After trying
> > > several experiment settings I wanted to run it with multi-giza and
> > > kenlm, but I cannot get it to work (tried it again with smaller
> > > corpus, same result. I tried to continue the experiment with different
> > > fixes - no success.
> > > The log tells me:
> > > step TUNING:tune crashed
> > > further inspection in TUNE_tune.1.STDERR in steps/1/ told me IRSTLM is
> > > messing with my project, "against" my will (at least I thought so):
> > > line=IRSTLM name=LM0 factor=0
> > >
> path=/home/moses/project_test_mgiza/experiment/lm/project-syndicate.binlm.1
> > > order=3
> > > Exception: Error: 4 number of threads specified but IRST LM is not
> > > threadsafe.
> > > Exit code: 1
> > > Failed to run moses with the config
> > > /home/moses/project_test_mgiza/experiment/tuning/moses.filtered.ini.1
> > > at /home/moses/mosesdecoder/scripts/training/mert-moses.pl line 1271.
> > > cp: cannot stat
> > > '/home/moses/project_test_mgiza/experiment/tuning/tmp.1/moses.ini': No
> > > such file or directory
> > > Looking up what happened in the tuning folder, I found out that
> > > moses.filtered.ini.1 has set IRSTLM for Distortion, but
> > > filtered.1/moses.ini has set KenLM for Distortion which satisfies what
> > > I hoped to get.
> > > I attached the files from above and the following is the config file
> > > of the experiment:
> > > ################################################
> > > ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
> > > ################################################
> > >
> > >
> > > [GENERAL]
> > >
> > > home-dir = /home/moses
> > >
> > > working-dir = $home-dir/project_test_mgiza/experiment
> > > moses-src-dir = $home-dir/mosesdecoder
> > > moses-script-dir = $moses-src-dir/scripts
> > > moses-bin-dir = $moses-src-dir/bin
> > > external-bin-dir = $moses-src-dir/BINDIR
> > > data-dir = $home-dir/project_test_mgiza/experiment/corpus
> > > train-dir = $data-dir/training
> > > dev-dir = $data-dir/dev
> > > #irstlm-dir = $home-dir/irstlm/bin
> > >
> > >
> > > ttable-binarizer = $moses-bin-dir/processPhraseTable
> > > decoder = $moses-bin-dir/moses
> > >
> > > input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> > > $input-extension -threads 4"
> > > output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -l
> > > $output-extension"
> > > input-truecaser = $moses-script-dir/recaser/truecase.perl
> > > output-truecaser = $moses-script-dir/recaser/truecase.perl
> > > detruecaser = $moses-script-dir/recaser/detruecase.perl
> > >
> > >
> > > input-extension = de
> > > output-extension = en
> > > pair-extension = de-en
> > >
> > > #################################################################
> > > # PARALLEL CORPUS PREPARATION:
> > > # create a tokenized, sentence-aligned corpus, ready for training
> > >
> > > [CORPUS]
> > >
> > > max-sentence-length = 80
> > >
> > > [CORPUS:project-syndicate]
> > > raw-stem = $train-dir/news-commentary-v8.$pair-extension
> > >
> > > [LM]
> > >
> > > ### tool to be used for language model training
> > > # for instance: ngram-count (SRILM), train-lm-on-disk.perl (Edinburgh)
> > > #
> > > #lm-training = "$moses-script-dir/generic/trainlm-irst2.perl -cores 4
> > > -irst-dir $irstlm-dir -temp-dir $working-dir/tmp"
> > > #settings = "-s msb -p 0"
> > > #order = 3
> > > #type = 8
> > > #lm-binarizer = $moses-bin-dir/build_binary
> > >
> > > # path to lmplz binary
> > > lmplz = $moses-bin-dir/lmplz
> > > # order of the language model
> > > order = 3
> > > # additional parameters to lmplz (check lmplz help message)
> > > settings = "-T $working-dir/tmp -S 10G"
> > > # this tells EMS to use lmplz and tells EMS where lmplz is located
> > > lm-training = "$moses-script-dir/generic/trainlm-lmplz.perl -lmplz
> > > $lmplz"
> > > lm-binarizer = $moses-bin-dir/build_binary
> > >
> > >
> > >
> > > [LM:project-syndicate]
> > > raw-corpus =
> > > $train-dir/news-commentary-v8.$pair-extension.$output-extension
> > >
> > >
> > > #################################################################
> > > # TRANSLATION MODEL TRAINING
> > >
> > > [TRAINING]
> > >
> > >
> > > ### training script to be used: either a legacy script or
> > > # current moses training script (default)
> > > #
> > > #script = $moses-script-dir/training/train-model.perl
> > >
> > >
> > > ### general options
> > > #
> > > script = $moses-script-dir/training/train-model.perl
> > > training-options = "-mgiza -mgiza-cpus 4 -cores 4 \
> > > -parallel -sort-buffer-size 10G -sort-batch-size 253 \
> > > -sort-compress gzip -sort-parallel 10"
> > > parallel = yes
> > >
> > > ### symmetrization method to obtain word alignments from giza output
> > > # (commonly used: grow-diag-final-and)
> > > #
> > > #alignment-symmetrization-method = berkeley
> > > alignment-symmetrization-method = grow-diag-final-and
> > >
> > > ### lexicalized reordering: specify orientation type
> > > # (default: only distance-based reordering model)
> > > #
> > > lexicalized-reordering = msd-bidirectional-fe
> > >
> > > ### if word alignment (giza symmetrization) should be skipped,
> > > # point to word alignment files
> > > #
> > > #word-alignment =
> > >
> > > ### if phrase extraction should be skipped,
> > > # point to stem for extract files
> > > #
> > > #extracted-phrases =
> > >
> > > ### if phrase table training should be skipped,
> > > # point to phrase translation table
> > > #
> > > #phrase-translation-table =
> > >
> > > ### if reordering table training should be skipped,
> > > # point to reordering table
> > > #
> > > #reordering-table =
> > >
> > > ### if training should be skipped,
> > > # point to a configuration file that contains
> > > # pointers to all relevant model files
> > > #
> > > #config =
> > >
> > > ### TUNING: finding good weights for model components
> > >
> > > [TUNING]
> > >
> > > ### instead of tuning with this setting, old weights may be recycled
> > >
> > > ### tuning script to be used
> > > #
> > > tuning-script = $moses-script-dir/training/mert-moses.pl
> > > tuning-settings = "-mertdir $moses-bin-dir -threads 4"
> > >
> > > ### specify the corpus used for tuning
> > > # it should contain 100s if not 1000s of sentences
> > > #
> > > raw-input = $dev-dir/news-test2008.$input-extension
> > >
> > > raw-reference = $dev-dir/news-test2008.$output-extension
> > >
> > > ### size of n-best list used (typically 100)
> > > #
> > > nbest = 100
> > >
> > > ### ranges for weights for random initialization
> > > # if not specified, the tuning script will use generic ranges
> > > # it is not clear, if this matters
> > > #
> > > # lambda =
> > >
> > > ### additional flags for the decoder
> > > #
> > > decoder-settings = "-threads 4"
> > >
> > > ### if tuning should be skipped, specify this here
> > > # and also point to a configuration file that contains
> > > # pointers to all relevant model files
> > > #
> > > #config =
> > >
> > >
> > > #######################################################
> > > ## TRUECASER: train model to truecase corpora and input
> > >
> > > [TRUECASER]
> > >
> > > ### script to train truecaser models
> > > #
> > > trainer = $moses-script-dir/recaser/train-truecaser.perl
> > >
> > > ### training data
> > > # raw input needs to be still tokenized,
> > > # also also tokenized input may be specified
> > > #
> > > raw-stem = CORPUS:raw-stem
> > >
> > > ### trained model
> > > #
> > > #truecase-model =
> > >
> > >
> > > ##################################
> > > ## EVALUATION: score system output
> > >
> > > [EVALUATION]
> > >
> > > ### prepare system output for scoring
> > > # this may include detokenization and wrapping output in sgm
> > > # (needed for nist-bleu, ter, meteor)
> > > #
> > > detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
> > > $output-extension"
> > >
> > > decoder-settings = "-threads 4"
> > >
> > > ### should output be scored case-sensitive (default: no)?
> > > #
> > > # case-sensitive = yes
> > >
> > > ### BLEU
> > > #
> > >
> > > multi-bleu = "$moses-script-dir/generic/multi-bleu.perl -lc"
> > > # ibm-bleu =
> > >
> > > ### TER: translation error rate (BBN metric) based on edit distance
> > > #
> > > # ter = $edinburgh-script-dir/tercom_v6a.pl
> > >
> > > ### METEOR: gives credit to stem / worknet synonym matches
> > > #
> > > # meteor =
> > >
> > > [EVALUATION:newstest2010]
> > > raw-input = $dev-dir/newstest2011.$input-extension
> > > raw-reference = $dev-dir/newstest2011.$output-extension
> > >
> > >
> > > [REPORTING]
> > >
> > > ### what to do with result (default: store in file evaluation/report)
> > > #
> > > # email = pko...@inf.ed.ac.uk
> > > ____________________
> > > I hope anybody can help or suggest me what to do.
> > > Thank you and kind regards
> > > Daniel
> > >
> > >
> > > _______________________________________________
> > > Moses-support mailing list
> > > Moses-support@mit.edu
> > > http://mailman.mit.edu/mailman/listinfo/moses-support
> > ***
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] EMS set up with mgiza and KenLM

Reply via email to