Re: [Moses-support] EMS fails on tuning

Tomas Hudik Tue, 29 May 2012 23:42:36 -0700

Hi Dimitris,
Write the error log if you translate some sentence, try e.g.:
echo "translate some sentence" ./moses -f your_moses.ini


cheers, Tomas

-----Original Message-----
From: Δημήτρης Μπαμπανιώτης [mailto:[email protected]] 
Sent: Tuesday, May 29, 2012 11:41 PM
To: Philipp Koehn
Cc: [email protected]
Subject: Re: [Moses-support] EMS fails on tuning

Στις 28/05/2012 10:01 μμ, ο/η Philipp Koehn έγραψε:
> Hi,
>
> there is a problem here:
>
> # conversion of phrase table into binary on-disk format 
> #ttable-binarizer = $moses-bin-dir/processPhraseTable
>
> # conversion of rule table into binary on-disk format ttable-binarizer 
> = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"
>
> You are using the ttable binarizer for the hierarchical/syntax model, 
> but you use a phrase-based model.
>
> -phi
>
> On Sun, May 27, 2012 at 11:45 PM, Dimitris Babaniotis 
> <[email protected]>  wrote:
>> Hello, I'm trying to run experiments with EMS but the process stops 
>> on tuning:tune.
>>
>> Here is the TUNING_tune.stderr file :
>>
>> main::create_extractor_script() called too early to check prototype 
>> at /home/dimbaba/moses/scripts/training/mert-moses.pl line 674.
>> Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts Asking moses for 
>> feature names and values from
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -show-weights>  ./features.list MERT starting values and 
>> ranges for random generation:
>> d = 0.600 ( 0.00 .. 1.00)
>> lm = 0.250 ( 0.00 .. 1.00)
>> lm = 0.250 ( 0.00 .. 1.00)
>> w = -1.000 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> Saved: ./run1.moses.ini
>> Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 
>> 0.200000
>> 0.200000 0.200000 0.200000
>> DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 
>> 0.064516
>> 0.064516 0.064516 0.064516 0.064516
>> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 
>> 0.064516
>> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 
>> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1>
>> run1.out
>> Translating line 0 in thread id 140471666632448 Check 
>> (*contextFactor[count-1])[factorType] != NULL failed in
>> moses/src/LM/SRI.cpp:155
>> sh: line 1: 1648 Ακυρώθηκε (core dumped) 
>> /home/dimbaba/moses/dist/bin/moses
>> -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 
>> 0.064516
>> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 
>> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1>
>> run1.out
>> Exit code: 134
>> The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 
>> 0.193548 -tm 0.064516 0.064516 0.064516 0.064516 0.064516
>> cp: cannot stat
>> «/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν 
>> υπάρχει τέτοιο αρχείο ή κατάλογος
>>
>>
>> ...and this is my configuration file:
>>
>>
>> ################################################
>> ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ### 
>> ################################################
>>
>> [GENERAL]
>>
>> ### directory in which experiment is run # working-dir = 
>> /home/dimbaba/mosesFactored/experiment
>>
>> # specification of the language pair
>> input-extension = de
>> output-extension = el
>> pair-extension = de-el
>>
>> ### directories that contain tools and data # # moses moses-src-dir = 
>> /home/dimbaba/moses # # moses binaries moses-bin-dir = 
>> $moses-src-dir/dist/bin # # moses scripts moses-script-dir = 
>> $moses-src-dir/scripts # # srilm srilm-dir = 
>> /home/dimbaba/srilm/bin/i686-m64 # # irstlm #irstlm-dir = 
>> $moses-src-dir/irstlm/bin # # randlm #randlm-dir = 
>> $moses-src-dir/randlm/bin # # data wmt12-data = 
>> /home/dimbaba/aligned/el-de
>>
>> ### basic tools
>> #
>> # moses decoder
>> decoder = $moses-bin-dir/moses
>>
>> # conversion of phrase table into binary on-disk format 
>> #ttable-binarizer = $moses-bin-dir/processPhraseTable
>>
>> # conversion of rule table into binary on-disk format 
>> ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"
>>
>> # tokenizers - comment out if all your data is already tokenized 
>> input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l 
>> $input-extension"
>> output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l 
>> $output-extension"
>>
>> # truecasers - comment out if you do not use the truecaser 
>> input-truecaser = $moses-script-dir/recaser/truecase.perl
>> output-truecaser = $moses-script-dir/recaser/truecase.perl
>> detruecaser = $moses-script-dir/recaser/detruecase.perl
>>
>> ### generic parallelizer for cluster and multi-core machines # you 
>> may specify a script that allows the parallel execution # 
>> parallizable steps (see meta file). you also need specify # the 
>> number of jobs (cluster) or cores (multicore) # #generic-parallelizer 
>> = $moses-script-dir/ems/support/generic-parallelizer.perl
>> #generic-parallelizer =
>> $moses-script-dir/ems/support/generic-multicore-parallelizer.perl
>>
>> ### cluster settings (if run on a cluster machine) # number of jobs 
>> to be submitted in parallel # #jobs = 10
>>
>> # arguments to qsub when scheduling a job #qsub-settings = ""
>>
>> # project for priviledges and usage accounting #qsub-project = 
>> iccs_smt
>>
>> # memory and time
>> #qsub-memory = 4
>> #qsub-hours = 48
>>
>> ### multi-core settings
>> # when the generic parallelizer is used, the number of cores # 
>> specified here cores = 4
>>
>> #################################################################
>> # PARALLEL CORPUS PREPARATION:
>> # create a tokenized, sentence-aligned corpus, ready for training
>>
>> [CORPUS]
>>
>> ### long sentences are filtered out, since they slow down GIZA++ # 
>> and are a less reliable source of data. set here the maximum # length 
>> of a sentence # max-sentence-length = 100
>>
>> [CORPUS:europarl] IGNORE
>>
>> ### command to run to get raw corpus files # # get-corpus-script =
>>
>> ### raw corpus files (untokenized, but sentence aligned) # raw-stem = 
>> $wmt12-data/training/training.clean10
>>
>> ### tokenized corpus files (may contain long sentences) # 
>> #tokenized-stem =
>>
>> ### if sentence filtering should be skipped, # point to the clean 
>> training data # #clean-stem =
>>
>> ### if corpus preparation should be skipped, # point to the prepared 
>> training data # #lowercased-stem =
>>
>> [CORPUS:nc]
>> raw-stem = $wmt12-data/training/training.clean10
>>
>> [CORPUS:un] IGNORE
>> raw-stem = $wmt12-data/training/training.clean10
>>
>> #################################################################
>> # LANGUAGE MODEL TRAINING
>>
>> [LM]
>>
>> ### tool to be used for language model training # srilm lm-training = 
>> $srilm-dir/ngram-count settings = ""
>>
>> # irstlm
>> #lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores 
>> $cores -irst-dir $irstlm-dir -temp-dir $working-dir/lm"
>> #settings = ""
>>
>> # order of the language model
>> order = 3
>>
>> ### tool to be used for training randomized language model from 
>> scratch # (more commonly, a SRILM is trained) # #rlm-training = 
>> "$randlm-dir/buildlm -falsepos 8 -values 8"
>>
>> ### script to use for binary table format for irstlm or kenlm # 
>> (default: no binarization)
>>
>> # irstlm
>> #lm-binarizer = $irstlm-dir/compile-lm
>>
>> # kenlm, also set type to 8
>> #lm-binarizer = $moses-bin-dir/build_binary #type = 8
>>
>> ### script to create quantized language model format (irstlm) # 
>> (default: no quantization) # #lm-quantizer = $irstlm-dir/quantize-lm
>>
>> ### script to use for converting into randomized table format # 
>> (default: no randomization) # #lm-randomizer = "$randlm-dir/buildlm 
>> -falsepos 8 -values 8"
>>
>> ### each language model to be used has its own section here
>>
>> [LM:europarl] IGNORE
>>
>> ### command to run to get raw corpus files # #get-corpus-script = ""
>>
>> ### raw corpus (untokenized)
>> #
>> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>>
>> ### tokenized corpus files (may contain long sentences) # 
>> #tokenized-corpus =
>>
>> ### if corpus preparation should be skipped, # point to the prepared 
>> language model # #lm =
>>
>> [LM:nc]
>> raw-corpus = $wmt12-data/training/training.clean10.$output-extension
>>
>> [LM:un] IGNORE
>> raw-corpus =
>> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>>
>> [LM:news] IGNORE
>> raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>>
>> [LM:nc=stem]
>> factors = "stem"
>> order = 3
>> settings = ""
>> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>>
>> #################################################################
>> # INTERPOLATING LANGUAGE MODELS
>>
>> [INTERPOLATED-LM] IGNORE
>>
>> # if multiple language models are used, these may be combined # by 
>> optimizing perplexity on a tuning set # see, for instance [Koehn and 
>> Schwenk, IJCNLP 2008]
>>
>> ### script to interpolate language models # if commented out, no 
>> interpolation is performed # script = 
>> $moses-script-dir/ems/support/interpolate-lm.perl
>>
>> ### tuning set
>> # you may use the same set that is used for mert tuning (reference 
>> set) # tuning-sgm = 
>> $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
>> #raw-tuning =
>> #tokenized-tuning =
>> #factored-tuning =
>> #lowercased-tuning =
>> #split-tuning =
>>
>> ### group language models for hierarchical interpolation # (flat 
>> interpolation is limited to 10 language models) #group = 
>> "first,second fourth,fifth"
>>
>> ### script to use for binary table format for irstlm or kenlm # 
>> (default: no binarization)
>>
>> # irstlm
>> #lm-binarizer = $irstlm-dir/compile-lm
>>
>> # kenlm, also set type to 8
>> #lm-binarizer = $moses-bin-dir/build_binary #type = 8
>>
>> ### script to create quantized language model format (irstlm) # 
>> (default: no quantization) # #lm-quantizer = $irstlm-dir/quantize-lm
>>
>> ### script to use for converting into randomized table format # 
>> (default: no randomization) # #lm-randomizer = "$randlm-dir/buildlm 
>> -falsepos 8 -values 8"
>>
>> #################################################################
>> # FACTOR DEFINITION
>>
>> [INPUT-FACTOR]
>>
>> # also used for output factors
>> temp-dir = $working-dir/training/factor [INPUT-FACTOR:stem]
>>
>> factor-script = 
>> "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> ### script that generates this factor # #mxpost = 
>> /home/pkoehn/bin/mxpost factor-script = 
>> "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> [OUTPUT-FACTOR:stem]
>>
>> factor-script = 
>> "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> ### script that generates this factor # #mxpost = 
>> /home/pkoehn/bin/mxpost factor-script = 
>> "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>>
>> #################################################################
>> # TRANSLATION MODEL TRAINING
>>
>> [TRAINING]
>>
>> ### training script to be used: either a legacy script or # current 
>> moses training script (default) # script = 
>> $moses-script-dir/training/train-model.perl
>>
>> ### general options
>> # these are options that are passed on to train-model.perl, for 
>> instance # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza # * 
>> "-sort-buffer-size 8G" to reduce on-disk sorting # #training-options 
>> = ""
>>
>> ### factored training: specify here which factors used # if none 
>> specified, single factor training is assumed # (one translation step, 
>> surface to surface) # input-factors = word stem output-factors = word 
>> stem alignment-factors = "stem ->  stem"
>> translation-factors = "word ->  word"
>> reordering-factors = "word ->  word"
>> #generation-factors =
>> decoding-steps = "t0"
>>
>> ### parallelization of data preparation step # the two directions of 
>> the data preparation can be run in parallel # comment out if not 
>> needed # parallel = yes
>>
>> ### pre-computation for giza++
>> # giza++ has a more efficient data structure that needs to be # 
>> initialized with snt2cooc. if run in parallel, this may reduces # 
>> memory requirements. set here the number of parts # 
>> #run-giza-in-parts = 5
>>
>> ### symmetrization method to obtain word alignments from giza output 
>> # (commonly used: grow-diag-final-and) # 
>> alignment-symmetrization-method = grow-diag-final-and
>>
>> ### use of berkeley aligner for word alignment # #use-berkeley = true 
>> #alignment-symmetrization-method = berkeley #berkeley-train = 
>> $moses-script-dir/ems/support/berkeley-train.sh
>> #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh
>> #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar
>> #berkeley-java-options = "-server -mx30000m -ea"
>> #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"
>> #berkeley-process-options = "-EMWordAligner.numThreads 8"
>> #berkeley-posterior = 0.5
>>
>> ### if word alignment should be skipped, # point to word alignment 
>> files # #word-alignment = $working-dir/model/aligned.1
>>
>> ### create a bilingual concordancer for the model # #biconcor = 
>> $moses-script-dir/ems/biconcor/biconcor
>>
>> ### lexicalized reordering: specify orientation type # (default: only 
>> distance-based reordering model) # lexicalized-reordering = 
>> msd-bidirectional-fe
>>
>> ### hierarchical rule set
>> #
>> hierarchical-rule-set = true
>>
>> ### settings for rule extraction
>> #
>> #extract-settings = ""
>>
>> ### unknown word labels (target syntax only) # enables use of unknown 
>> word labels during decoding # label file is generated during rule 
>> extraction # #use-unknown-word-labels = true
>>
>> ### if phrase extraction should be skipped, # point to stem for 
>> extract files # # extracted-phrases =
>>
>> ### settings for rule scoring
>> #
>> score-settings = "--GoodTuring"
>>
>> ### include word alignment in phrase table # 
>> #include-word-alignment-in-rules = yes
>>
>> ### if phrase table training should be skipped, # point to phrase 
>> translation table # # phrase-translation-table =
>>
>> ### if reordering table training should be skipped, # point to 
>> reordering table # # reordering-table =
>>
>> ### if training should be skipped,
>> # point to a configuration file that contains # pointers to all 
>> relevant model files # #config-with-reused-weights =
>>
>> #####################################################
>> ### TUNING: finding good weights for model components
>>
>> [TUNING]
>>
>> ### instead of tuning with this setting, old weights may be recycled 
>> # specify here an old configuration file with matching weights # 
>> #weight-config = $working-dir/tuning/moses.filtered.ini.1
>>
>> ### tuning script to be used
>> #
>> tuning-script = $moses-script-dir/training/mert-moses.pl
>> tuning-settings = "-mertdir $moses-bin-dir --filtercmd 
>> '$moses-script-dir/training/filter-model-given-input.pl'"
>>
>> ### specify the corpus used for tuning # it should contain 1000s of 
>> sentences # #input-sgm = raw-input = 
>> $wmt12-data/tuning/tuning.clean.$input-extension
>> #tokenized-input =
>> #factorized-input =
>> #input =
>> #
>> #reference-sgm =
>> raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension
>> #tokenized-reference =
>> #factorized-reference =
>> #reference =
>>
>> ### size of n-best list used (typically 100) # nbest = 100
>>
>> ### ranges for weights for random initialization # if not specified, 
>> the tuning script will use generic ranges # it is not clear, if this 
>> matters # # lambda =
>>
>> ### additional flags for the filter script # #filter-settings = 
>> "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical"
>>
>> ### additional flags for the decoder
>> #
>> decoder-settings = ""
>>
>> ### if tuning should be skipped, specify this here # and also point 
>> to a configuration file that contains # pointers to all relevant 
>> model files # #config =
>>
>> #########################################################
>> ## RECASER: restore case, this part only trains the model
>>
>> [RECASING]
>>
>> #decoder = $moses-bin-dir/moses
>>
>> ### training data
>> # raw input needs to be still tokenized, # also also tokenized input 
>> may be specified # #tokenized = [LM:europarl:tokenized-corpus]
>>
>> # recase-config =
>>
>> #lm-training = $srilm-dir/ngram-count
>>
>> #######################################################
>> ## TRUECASER: train model to truecase corpora and input
>>
>> [TRUECASER]
>>
>> ### script to train truecaser models
>> #
>> trainer = $moses-script-dir/recaser/train-truecaser.perl
>>
>> ### training data
>> # data on which truecaser is trained
>> # if no training data is specified, parallel corpus is used # # 
>> raw-stem = # tokenized-stem =
>>
>> ### trained model
>> #
>> # truecase-model =
>>
>> #####################################################################
>> # ## EVALUATION: translating a test set using the tuned system and 
>> score it
>>
>> [EVALUATION]
>>
>> ### number of jobs (if parallel execution on cluster) # #jobs = 10
>>
>> ### additional flags for the filter script # #filter-settings = ""
>>
>> ### additional decoder settings
>> # switches for the Moses decoder
>> # common choices:
>> # "-threads N" for multi-threading
>> # "-mbr" for MBR decoding
>> # "-drop-unknown" for dropping unknown source words # 
>> "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube 
>> pruning # decoder-settings = "-search-algorithm 1 
>> -cube-pruning-pop-limit 5000 -s 5000"
>>
>> ### specify size of n-best list, if produced # #nbest = 100
>>
>> ### multiple reference translations
>> #
>> #multiref = yes
>>
>> ### prepare system output for scoring # this may include 
>> detokenization and wrapping output in sgm # (needed for nist-bleu, 
>> ter, meteor) # detokenizer = 
>> "$moses-script-dir/tokenizer/detokenizer.perl -l $output-extension"
>> #recaser = $moses-script-dir/recaser/recase.perl
>> wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl
>> $output-extension"
>> #output-sgm =
>>
>> ### BLEU
>> #
>> nist-bleu = $moses-script-dir/generic/mteval-v13a.pl
>> nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c"
>> #multi-bleu = $moses-script-dir/generic/multi-bleu.perl
>> #ibm-bleu =
>>
>> ### TER: translation error rate (BBN metric) based on edit distance # 
>> not yet integrated # # ter =
>>
>> ### METEOR: gives credit to stem / worknet synonym matches # not yet 
>> integrated # # meteor =
>>
>> ### Analysis: carry out various forms of analysis on the output # 
>> analysis = $moses-script-dir/ems/support/analysis.perl
>> #
>> # also report on input coverage
>> analyze-coverage = yes
>> #
>> # also report on phrase mappings used report-segmentation = yes # # 
>> report precision of translations for each input word, broken down by 
>> # count of input word in corpus and model 
>> #report-precision-by-coverage = yes # # further precision breakdown 
>> by factor #precision-by-coverage-factor = pos
>>
>> [EVALUATION:newstest2011]
>>
>> ### input data
>> #
>> #input-sgm = "$wmt12-data/$input-extension-test.txt"
>> #raw-input = $wmt12-data/$input-extension-test.txt
>> tokenized-input = "$wmt12-data/de-test.txt"
>> # factorized-input =
>> #input = $wmt12-data/$input-extension-test.txt
>>
>> ### reference data
>> #
>> #reference-sgm = "$wmt12-data/$output-extension-test.txt"
>> #raw-reference ="wmt12-data/$output-extension -test.txt 
>> tokenized-reference = "$wmt12-data/el-test.txt"
>> #reference = $wmt12-data/el-test.txt
>>
>> ### analysis settings
>> # may contain any of the general evaluation analysis settings # 
>> specific setting: base coverage statistics on earlier run # 
>> #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5
>>
>> ### wrapping frame
>> # for nist-bleu and other scoring scripts, the output needs to be 
>> wrapped # in sgm markup (typically like the input sgm) # 
>> wrapping-frame = $tokenized-input
>>
>> ##########################################
>> ### REPORTING: summarize evaluation scores
>>
>> [REPORTING]
>>
>> ### currently no parameters for reporting section
>>
>> Thank you,
>>
>> Dimitris Babaniotis
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
Hi, thank you for your answer,

I fixed the problem that you mentioned but the problem still exists.

I searched more and i found that the error occurs when the decoder tries to to 
translate a sentence.
The problem exists with or without EMS.

Dimitris



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] EMS fails on tuning

Reply via email to