Re: [Moses-support] EMS fails on tuning

Δημήτρης Μπαμπανιώτης Tue, 29 May 2012 14:42:40 -0700

Στις 28/05/2012 10:01 μμ, ο/η Philipp Koehn έγραψε:
> Hi,
>
> there is a problem here:
>
> # conversion of phrase table into binary on-disk format
> #ttable-binarizer = $moses-bin-dir/processPhraseTable
>
> # conversion of rule table into binary on-disk format
> ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"
>
> You are using the ttable binarizer for the hierarchical/syntax model,
> but you use a phrase-based model.
>
> -phi
>
> On Sun, May 27, 2012 at 11:45 PM, Dimitris Babaniotis
> <[email protected]>  wrote:
>> Hello, I'm trying to run experiments with EMS but the process stops on
>> tuning:tune.
>>
>> Here is the TUNING_tune.stderr file :
>>
>> main::create_extractor_script() called too early to check prototype at
>> /home/dimbaba/moses/scripts/training/mert-moses.pl line 674.
>> Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts
>> Asking moses for feature names and values from
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -show-weights>  ./features.list
>> MERT starting values and ranges for random generation:
>> d = 0.600 ( 0.00 .. 1.00)
>> lm = 0.250 ( 0.00 .. 1.00)
>> lm = 0.250 ( 0.00 .. 1.00)
>> w = -1.000 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> tm = 0.200 ( 0.00 .. 1.00)
>> Saved: ./run1.moses.ini
>> Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 0.200000
>> 0.200000 0.200000 0.200000
>> DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
>> 0.064516 0.064516 0.064516 0.064516
>> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
>> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
>> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1>
>> run1.out
>> Translating line 0 in thread id 140471666632448
>> Check (*contextFactor[count-1])[factorType] != NULL failed in
>> moses/src/LM/SRI.cpp:155
>> sh: line 1: 1648 Ακυρώθηκε (core dumped) /home/dimbaba/moses/dist/bin/moses
>> -v 0 -config
>> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
>> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
>> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
>> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1>
>> run1.out
>> Exit code: 134
>> The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 0.193548
>> -tm 0.064516 0.064516 0.064516 0.064516 0.064516
>> cp: cannot stat
>> «/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν υπάρχει
>> τέτοιο αρχείο ή κατάλογος
>>
>>
>> ...and this is my configuration file:
>>
>>
>> ################################################
>> ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
>> ################################################
>>
>> [GENERAL]
>>
>> ### directory in which experiment is run
>> #
>> working-dir = /home/dimbaba/mosesFactored/experiment
>>
>> # specification of the language pair
>> input-extension = de
>> output-extension = el
>> pair-extension = de-el
>>
>> ### directories that contain tools and data
>> #
>> # moses
>> moses-src-dir = /home/dimbaba/moses
>> #
>> # moses binaries
>> moses-bin-dir = $moses-src-dir/dist/bin
>> #
>> # moses scripts
>> moses-script-dir = $moses-src-dir/scripts
>> #
>> # srilm
>> srilm-dir = /home/dimbaba/srilm/bin/i686-m64
>> #
>> # irstlm
>> #irstlm-dir = $moses-src-dir/irstlm/bin
>> #
>> # randlm
>> #randlm-dir = $moses-src-dir/randlm/bin
>> #
>> # data
>> wmt12-data = /home/dimbaba/aligned/el-de
>>
>> ### basic tools
>> #
>> # moses decoder
>> decoder = $moses-bin-dir/moses
>>
>> # conversion of phrase table into binary on-disk format
>> #ttable-binarizer = $moses-bin-dir/processPhraseTable
>>
>> # conversion of rule table into binary on-disk format
>> ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"
>>
>> # tokenizers - comment out if all your data is already tokenized
>> input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
>> $input-extension"
>> output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
>> $output-extension"
>>
>> # truecasers - comment out if you do not use the truecaser
>> input-truecaser = $moses-script-dir/recaser/truecase.perl
>> output-truecaser = $moses-script-dir/recaser/truecase.perl
>> detruecaser = $moses-script-dir/recaser/detruecase.perl
>>
>> ### generic parallelizer for cluster and multi-core machines
>> # you may specify a script that allows the parallel execution
>> # parallizable steps (see meta file). you also need specify
>> # the number of jobs (cluster) or cores (multicore)
>> #
>> #generic-parallelizer =
>> $moses-script-dir/ems/support/generic-parallelizer.perl
>> #generic-parallelizer =
>> $moses-script-dir/ems/support/generic-multicore-parallelizer.perl
>>
>> ### cluster settings (if run on a cluster machine)
>> # number of jobs to be submitted in parallel
>> #
>> #jobs = 10
>>
>> # arguments to qsub when scheduling a job
>> #qsub-settings = ""
>>
>> # project for priviledges and usage accounting
>> #qsub-project = iccs_smt
>>
>> # memory and time
>> #qsub-memory = 4
>> #qsub-hours = 48
>>
>> ### multi-core settings
>> # when the generic parallelizer is used, the number of cores
>> # specified here
>> cores = 4
>>
>> #################################################################
>> # PARALLEL CORPUS PREPARATION:
>> # create a tokenized, sentence-aligned corpus, ready for training
>>
>> [CORPUS]
>>
>> ### long sentences are filtered out, since they slow down GIZA++
>> # and are a less reliable source of data. set here the maximum
>> # length of a sentence
>> #
>> max-sentence-length = 100
>>
>> [CORPUS:europarl] IGNORE
>>
>> ### command to run to get raw corpus files
>> #
>> # get-corpus-script =
>>
>> ### raw corpus files (untokenized, but sentence aligned)
>> #
>> raw-stem = $wmt12-data/training/training.clean10
>>
>> ### tokenized corpus files (may contain long sentences)
>> #
>> #tokenized-stem =
>>
>> ### if sentence filtering should be skipped,
>> # point to the clean training data
>> #
>> #clean-stem =
>>
>> ### if corpus preparation should be skipped,
>> # point to the prepared training data
>> #
>> #lowercased-stem =
>>
>> [CORPUS:nc]
>> raw-stem = $wmt12-data/training/training.clean10
>>
>> [CORPUS:un] IGNORE
>> raw-stem = $wmt12-data/training/training.clean10
>>
>> #################################################################
>> # LANGUAGE MODEL TRAINING
>>
>> [LM]
>>
>> ### tool to be used for language model training
>> # srilm
>> lm-training = $srilm-dir/ngram-count
>> settings = ""
>>
>> # irstlm
>> #lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores
>> -irst-dir $irstlm-dir -temp-dir $working-dir/lm"
>> #settings = ""
>>
>> # order of the language model
>> order = 3
>>
>> ### tool to be used for training randomized language model from scratch
>> # (more commonly, a SRILM is trained)
>> #
>> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>
>> ### script to use for binary table format for irstlm or kenlm
>> # (default: no binarization)
>>
>> # irstlm
>> #lm-binarizer = $irstlm-dir/compile-lm
>>
>> # kenlm, also set type to 8
>> #lm-binarizer = $moses-bin-dir/build_binary
>> #type = 8
>>
>> ### script to create quantized language model format (irstlm)
>> # (default: no quantization)
>> #
>> #lm-quantizer = $irstlm-dir/quantize-lm
>>
>> ### script to use for converting into randomized table format
>> # (default: no randomization)
>> #
>> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>
>> ### each language model to be used has its own section here
>>
>> [LM:europarl] IGNORE
>>
>> ### command to run to get raw corpus files
>> #
>> #get-corpus-script = ""
>>
>> ### raw corpus (untokenized)
>> #
>> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>>
>> ### tokenized corpus files (may contain long sentences)
>> #
>> #tokenized-corpus =
>>
>> ### if corpus preparation should be skipped,
>> # point to the prepared language model
>> #
>> #lm =
>>
>> [LM:nc]
>> raw-corpus = $wmt12-data/training/training.clean10.$output-extension
>>
>> [LM:un] IGNORE
>> raw-corpus =
>> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>>
>> [LM:news] IGNORE
>> raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>>
>> [LM:nc=stem]
>> factors = "stem"
>> order = 3
>> settings = ""
>> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>>
>> #################################################################
>> # INTERPOLATING LANGUAGE MODELS
>>
>> [INTERPOLATED-LM] IGNORE
>>
>> # if multiple language models are used, these may be combined
>> # by optimizing perplexity on a tuning set
>> # see, for instance [Koehn and Schwenk, IJCNLP 2008]
>>
>> ### script to interpolate language models
>> # if commented out, no interpolation is performed
>> #
>> script = $moses-script-dir/ems/support/interpolate-lm.perl
>>
>> ### tuning set
>> # you may use the same set that is used for mert tuning (reference set)
>> #
>> tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
>> #raw-tuning =
>> #tokenized-tuning =
>> #factored-tuning =
>> #lowercased-tuning =
>> #split-tuning =
>>
>> ### group language models for hierarchical interpolation
>> # (flat interpolation is limited to 10 language models)
>> #group = "first,second fourth,fifth"
>>
>> ### script to use for binary table format for irstlm or kenlm
>> # (default: no binarization)
>>
>> # irstlm
>> #lm-binarizer = $irstlm-dir/compile-lm
>>
>> # kenlm, also set type to 8
>> #lm-binarizer = $moses-bin-dir/build_binary
>> #type = 8
>>
>> ### script to create quantized language model format (irstlm)
>> # (default: no quantization)
>> #
>> #lm-quantizer = $irstlm-dir/quantize-lm
>>
>> ### script to use for converting into randomized table format
>> # (default: no randomization)
>> #
>> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>>
>> #################################################################
>> # FACTOR DEFINITION
>>
>> [INPUT-FACTOR]
>>
>> # also used for output factors
>> temp-dir = $working-dir/training/factor
>> [INPUT-FACTOR:stem]
>>
>> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> ### script that generates this factor
>> #
>> #mxpost = /home/pkoehn/bin/mxpost
>> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> [OUTPUT-FACTOR:stem]
>>
>> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>> ### script that generates this factor
>> #
>> #mxpost = /home/pkoehn/bin/mxpost
>> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
>> 3"
>>
>> #################################################################
>> # TRANSLATION MODEL TRAINING
>>
>> [TRAINING]
>>
>> ### training script to be used: either a legacy script or
>> # current moses training script (default)
>> #
>> script = $moses-script-dir/training/train-model.perl
>>
>> ### general options
>> # these are options that are passed on to train-model.perl, for instance
>> # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza
>> # * "-sort-buffer-size 8G" to reduce on-disk sorting
>> #
>> #training-options = ""
>>
>> ### factored training: specify here which factors used
>> # if none specified, single factor training is assumed
>> # (one translation step, surface to surface)
>> #
>> input-factors = word stem
>> output-factors = word stem
>> alignment-factors = "stem ->  stem"
>> translation-factors = "word ->  word"
>> reordering-factors = "word ->  word"
>> #generation-factors =
>> decoding-steps = "t0"
>>
>> ### parallelization of data preparation step
>> # the two directions of the data preparation can be run in parallel
>> # comment out if not needed
>> #
>> parallel = yes
>>
>> ### pre-computation for giza++
>> # giza++ has a more efficient data structure that needs to be
>> # initialized with snt2cooc. if run in parallel, this may reduces
>> # memory requirements. set here the number of parts
>> #
>> #run-giza-in-parts = 5
>>
>> ### symmetrization method to obtain word alignments from giza output
>> # (commonly used: grow-diag-final-and)
>> #
>> alignment-symmetrization-method = grow-diag-final-and
>>
>> ### use of berkeley aligner for word alignment
>> #
>> #use-berkeley = true
>> #alignment-symmetrization-method = berkeley
>> #berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh
>> #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh
>> #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar
>> #berkeley-java-options = "-server -mx30000m -ea"
>> #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"
>> #berkeley-process-options = "-EMWordAligner.numThreads 8"
>> #berkeley-posterior = 0.5
>>
>> ### if word alignment should be skipped,
>> # point to word alignment files
>> #
>> #word-alignment = $working-dir/model/aligned.1
>>
>> ### create a bilingual concordancer for the model
>> #
>> #biconcor = $moses-script-dir/ems/biconcor/biconcor
>>
>> ### lexicalized reordering: specify orientation type
>> # (default: only distance-based reordering model)
>> #
>> lexicalized-reordering = msd-bidirectional-fe
>>
>> ### hierarchical rule set
>> #
>> hierarchical-rule-set = true
>>
>> ### settings for rule extraction
>> #
>> #extract-settings = ""
>>
>> ### unknown word labels (target syntax only)
>> # enables use of unknown word labels during decoding
>> # label file is generated during rule extraction
>> #
>> #use-unknown-word-labels = true
>>
>> ### if phrase extraction should be skipped,
>> # point to stem for extract files
>> #
>> # extracted-phrases =
>>
>> ### settings for rule scoring
>> #
>> score-settings = "--GoodTuring"
>>
>> ### include word alignment in phrase table
>> #
>> #include-word-alignment-in-rules = yes
>>
>> ### if phrase table training should be skipped,
>> # point to phrase translation table
>> #
>> # phrase-translation-table =
>>
>> ### if reordering table training should be skipped,
>> # point to reordering table
>> #
>> # reordering-table =
>>
>> ### if training should be skipped,
>> # point to a configuration file that contains
>> # pointers to all relevant model files
>> #
>> #config-with-reused-weights =
>>
>> #####################################################
>> ### TUNING: finding good weights for model components
>>
>> [TUNING]
>>
>> ### instead of tuning with this setting, old weights may be recycled
>> # specify here an old configuration file with matching weights
>> #
>> #weight-config = $working-dir/tuning/moses.filtered.ini.1
>>
>> ### tuning script to be used
>> #
>> tuning-script = $moses-script-dir/training/mert-moses.pl
>> tuning-settings = "-mertdir $moses-bin-dir --filtercmd
>> '$moses-script-dir/training/filter-model-given-input.pl'"
>>
>> ### specify the corpus used for tuning
>> # it should contain 1000s of sentences
>> #
>> #input-sgm =
>> raw-input = $wmt12-data/tuning/tuning.clean.$input-extension
>> #tokenized-input =
>> #factorized-input =
>> #input =
>> #
>> #reference-sgm =
>> raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension
>> #tokenized-reference =
>> #factorized-reference =
>> #reference =
>>
>> ### size of n-best list used (typically 100)
>> #
>> nbest = 100
>>
>> ### ranges for weights for random initialization
>> # if not specified, the tuning script will use generic ranges
>> # it is not clear, if this matters
>> #
>> # lambda =
>>
>> ### additional flags for the filter script
>> #
>> #filter-settings = "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical"
>>
>> ### additional flags for the decoder
>> #
>> decoder-settings = ""
>>
>> ### if tuning should be skipped, specify this here
>> # and also point to a configuration file that contains
>> # pointers to all relevant model files
>> #
>> #config =
>>
>> #########################################################
>> ## RECASER: restore case, this part only trains the model
>>
>> [RECASING]
>>
>> #decoder = $moses-bin-dir/moses
>>
>> ### training data
>> # raw input needs to be still tokenized,
>> # also also tokenized input may be specified
>> #
>> #tokenized = [LM:europarl:tokenized-corpus]
>>
>> # recase-config =
>>
>> #lm-training = $srilm-dir/ngram-count
>>
>> #######################################################
>> ## TRUECASER: train model to truecase corpora and input
>>
>> [TRUECASER]
>>
>> ### script to train truecaser models
>> #
>> trainer = $moses-script-dir/recaser/train-truecaser.perl
>>
>> ### training data
>> # data on which truecaser is trained
>> # if no training data is specified, parallel corpus is used
>> #
>> # raw-stem =
>> # tokenized-stem =
>>
>> ### trained model
>> #
>> # truecase-model =
>>
>> ######################################################################
>> ## EVALUATION: translating a test set using the tuned system and score it
>>
>> [EVALUATION]
>>
>> ### number of jobs (if parallel execution on cluster)
>> #
>> #jobs = 10
>>
>> ### additional flags for the filter script
>> #
>> #filter-settings = ""
>>
>> ### additional decoder settings
>> # switches for the Moses decoder
>> # common choices:
>> # "-threads N" for multi-threading
>> # "-mbr" for MBR decoding
>> # "-drop-unknown" for dropping unknown source words
>> # "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube
>> pruning
>> #
>> decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s
>> 5000"
>>
>> ### specify size of n-best list, if produced
>> #
>> #nbest = 100
>>
>> ### multiple reference translations
>> #
>> #multiref = yes
>>
>> ### prepare system output for scoring
>> # this may include detokenization and wrapping output in sgm
>> # (needed for nist-bleu, ter, meteor)
>> #
>> detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
>> $output-extension"
>> #recaser = $moses-script-dir/recaser/recase.perl
>> wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl
>> $output-extension"
>> #output-sgm =
>>
>> ### BLEU
>> #
>> nist-bleu = $moses-script-dir/generic/mteval-v13a.pl
>> nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c"
>> #multi-bleu = $moses-script-dir/generic/multi-bleu.perl
>> #ibm-bleu =
>>
>> ### TER: translation error rate (BBN metric) based on edit distance
>> # not yet integrated
>> #
>> # ter =
>>
>> ### METEOR: gives credit to stem / worknet synonym matches
>> # not yet integrated
>> #
>> # meteor =
>>
>> ### Analysis: carry out various forms of analysis on the output
>> #
>> analysis = $moses-script-dir/ems/support/analysis.perl
>> #
>> # also report on input coverage
>> analyze-coverage = yes
>> #
>> # also report on phrase mappings used
>> report-segmentation = yes
>> #
>> # report precision of translations for each input word, broken down by
>> # count of input word in corpus and model
>> #report-precision-by-coverage = yes
>> #
>> # further precision breakdown by factor
>> #precision-by-coverage-factor = pos
>>
>> [EVALUATION:newstest2011]
>>
>> ### input data
>> #
>> #input-sgm = "$wmt12-data/$input-extension-test.txt"
>> #raw-input = $wmt12-data/$input-extension-test.txt
>> tokenized-input = "$wmt12-data/de-test.txt"
>> # factorized-input =
>> #input = $wmt12-data/$input-extension-test.txt
>>
>> ### reference data
>> #
>> #reference-sgm = "$wmt12-data/$output-extension-test.txt"
>> #raw-reference ="wmt12-data/$output-extension -test.txt
>> tokenized-reference = "$wmt12-data/el-test.txt"
>> #reference = $wmt12-data/el-test.txt
>>
>> ### analysis settings
>> # may contain any of the general evaluation analysis settings
>> # specific setting: base coverage statistics on earlier run
>> #
>> #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5
>>
>> ### wrapping frame
>> # for nist-bleu and other scoring scripts, the output needs to be wrapped
>> # in sgm markup (typically like the input sgm)
>> #
>> wrapping-frame = $tokenized-input
>>
>> ##########################################
>> ### REPORTING: summarize evaluation scores
>>
>> [REPORTING]
>>
>> ### currently no parameters for reporting section
>>
>> Thank you,
>>
>> Dimitris Babaniotis
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
Hi, thank you for your answer,


I fixed the problem that you mentioned but the problem still exists.

I searched more and i found that the error occurs when the decoder tries 
to to translate a sentence.
The problem exists with or without EMS.

Dimitris

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] EMS fails on tuning

Reply via email to