Re: [Moses-support] EMS fails on tuning

Philipp Koehn Mon, 28 May 2012 12:02:43 -0700

Hi,

there is a problem here:


# conversion of phrase table into binary on-disk format
#ttable-binarizer = $moses-bin-dir/processPhraseTable

# conversion of rule table into binary on-disk format
ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"

You are using the ttable binarizer for the hierarchical/syntax model,
but you use a phrase-based model.

-phi

On Sun, May 27, 2012 at 11:45 PM, Dimitris Babaniotis
<[email protected]> wrote:
> Hello, I'm trying to run experiments with EMS but the process stops on
> tuning:tune.
>
> Here is the TUNING_tune.stderr file :
>
> main::create_extractor_script() called too early to check prototype at
> /home/dimbaba/moses/scripts/training/mert-moses.pl line 674.
> Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts
> Asking moses for feature names and values from
> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
> -inputtype 0 -show-weights > ./features.list
> MERT starting values and ranges for random generation:
> d = 0.600 ( 0.00 .. 1.00)
> lm = 0.250 ( 0.00 .. 1.00)
> lm = 0.250 ( 0.00 .. 1.00)
> w = -1.000 ( 0.00 .. 1.00)
> tm = 0.200 ( 0.00 .. 1.00)
> tm = 0.200 ( 0.00 .. 1.00)
> tm = 0.200 ( 0.00 .. 1.00)
> tm = 0.200 ( 0.00 .. 1.00)
> tm = 0.200 ( 0.00 .. 1.00)
> Saved: ./run1.moses.ini
> Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 0.200000
> 0.200000 0.200000 0.200000
> DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
> 0.064516 0.064516 0.064516 0.064516
> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 >
> run1.out
> Translating line 0 in thread id 140471666632448
> Check (*contextFactor[count-1])[factorType] != NULL failed in
> moses/src/LM/SRI.cpp:155
> sh: line 1: 1648 Ακυρώθηκε (core dumped) /home/dimbaba/moses/dist/bin/moses
> -v 0 -config
> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 >
> run1.out
> Exit code: 134
> The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 0.193548
> -tm 0.064516 0.064516 0.064516 0.064516 0.064516
> cp: cannot stat
> «/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν υπάρχει
> τέτοιο αρχείο ή κατάλογος
>
>
> ...and this is my configuration file:
>
>
> ################################################
> ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
> ################################################
>
> [GENERAL]
>
> ### directory in which experiment is run
> #
> working-dir = /home/dimbaba/mosesFactored/experiment
>
> # specification of the language pair
> input-extension = de
> output-extension = el
> pair-extension = de-el
>
> ### directories that contain tools and data
> #
> # moses
> moses-src-dir = /home/dimbaba/moses
> #
> # moses binaries
> moses-bin-dir = $moses-src-dir/dist/bin
> #
> # moses scripts
> moses-script-dir = $moses-src-dir/scripts
> #
> # srilm
> srilm-dir = /home/dimbaba/srilm/bin/i686-m64
> #
> # irstlm
> #irstlm-dir = $moses-src-dir/irstlm/bin
> #
> # randlm
> #randlm-dir = $moses-src-dir/randlm/bin
> #
> # data
> wmt12-data = /home/dimbaba/aligned/el-de
>
> ### basic tools
> #
> # moses decoder
> decoder = $moses-bin-dir/moses
>
> # conversion of phrase table into binary on-disk format
> #ttable-binarizer = $moses-bin-dir/processPhraseTable
>
> # conversion of rule table into binary on-disk format
> ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"
>
> # tokenizers - comment out if all your data is already tokenized
> input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
> $input-extension"
> output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
> $output-extension"
>
> # truecasers - comment out if you do not use the truecaser
> input-truecaser = $moses-script-dir/recaser/truecase.perl
> output-truecaser = $moses-script-dir/recaser/truecase.perl
> detruecaser = $moses-script-dir/recaser/detruecase.perl
>
> ### generic parallelizer for cluster and multi-core machines
> # you may specify a script that allows the parallel execution
> # parallizable steps (see meta file). you also need specify
> # the number of jobs (cluster) or cores (multicore)
> #
> #generic-parallelizer =
> $moses-script-dir/ems/support/generic-parallelizer.perl
> #generic-parallelizer =
> $moses-script-dir/ems/support/generic-multicore-parallelizer.perl
>
> ### cluster settings (if run on a cluster machine)
> # number of jobs to be submitted in parallel
> #
> #jobs = 10
>
> # arguments to qsub when scheduling a job
> #qsub-settings = ""
>
> # project for priviledges and usage accounting
> #qsub-project = iccs_smt
>
> # memory and time
> #qsub-memory = 4
> #qsub-hours = 48
>
> ### multi-core settings
> # when the generic parallelizer is used, the number of cores
> # specified here
> cores = 4
>
> #################################################################
> # PARALLEL CORPUS PREPARATION:
> # create a tokenized, sentence-aligned corpus, ready for training
>
> [CORPUS]
>
> ### long sentences are filtered out, since they slow down GIZA++
> # and are a less reliable source of data. set here the maximum
> # length of a sentence
> #
> max-sentence-length = 100
>
> [CORPUS:europarl] IGNORE
>
> ### command to run to get raw corpus files
> #
> # get-corpus-script =
>
> ### raw corpus files (untokenized, but sentence aligned)
> #
> raw-stem = $wmt12-data/training/training.clean10
>
> ### tokenized corpus files (may contain long sentences)
> #
> #tokenized-stem =
>
> ### if sentence filtering should be skipped,
> # point to the clean training data
> #
> #clean-stem =
>
> ### if corpus preparation should be skipped,
> # point to the prepared training data
> #
> #lowercased-stem =
>
> [CORPUS:nc]
> raw-stem = $wmt12-data/training/training.clean10
>
> [CORPUS:un] IGNORE
> raw-stem = $wmt12-data/training/training.clean10
>
> #################################################################
> # LANGUAGE MODEL TRAINING
>
> [LM]
>
> ### tool to be used for language model training
> # srilm
> lm-training = $srilm-dir/ngram-count
> settings = ""
>
> # irstlm
> #lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores
> -irst-dir $irstlm-dir -temp-dir $working-dir/lm"
> #settings = ""
>
> # order of the language model
> order = 3
>
> ### tool to be used for training randomized language model from scratch
> # (more commonly, a SRILM is trained)
> #
> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"
>
> ### script to use for binary table format for irstlm or kenlm
> # (default: no binarization)
>
> # irstlm
> #lm-binarizer = $irstlm-dir/compile-lm
>
> # kenlm, also set type to 8
> #lm-binarizer = $moses-bin-dir/build_binary
> #type = 8
>
> ### script to create quantized language model format (irstlm)
> # (default: no quantization)
> #
> #lm-quantizer = $irstlm-dir/quantize-lm
>
> ### script to use for converting into randomized table format
> # (default: no randomization)
> #
> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>
> ### each language model to be used has its own section here
>
> [LM:europarl] IGNORE
>
> ### command to run to get raw corpus files
> #
> #get-corpus-script = ""
>
> ### raw corpus (untokenized)
> #
> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>
> ### tokenized corpus files (may contain long sentences)
> #
> #tokenized-corpus =
>
> ### if corpus preparation should be skipped,
> # point to the prepared language model
> #
> #lm =
>
> [LM:nc]
> raw-corpus = $wmt12-data/training/training.clean10.$output-extension
>
> [LM:un] IGNORE
> raw-corpus =
> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension
>
> [LM:news] IGNORE
> raw-corpus = $wmt12-data/training/news.$output-extension.shuffled
>
> [LM:nc=stem]
> factors = "stem"
> order = 3
> settings = ""
> raw-corpus = $wmt12-data/training/training.clean.$output-extension
>
> #################################################################
> # INTERPOLATING LANGUAGE MODELS
>
> [INTERPOLATED-LM] IGNORE
>
> # if multiple language models are used, these may be combined
> # by optimizing perplexity on a tuning set
> # see, for instance [Koehn and Schwenk, IJCNLP 2008]
>
> ### script to interpolate language models
> # if commented out, no interpolation is performed
> #
> script = $moses-script-dir/ems/support/interpolate-lm.perl
>
> ### tuning set
> # you may use the same set that is used for mert tuning (reference set)
> #
> tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
> #raw-tuning =
> #tokenized-tuning =
> #factored-tuning =
> #lowercased-tuning =
> #split-tuning =
>
> ### group language models for hierarchical interpolation
> # (flat interpolation is limited to 10 language models)
> #group = "first,second fourth,fifth"
>
> ### script to use for binary table format for irstlm or kenlm
> # (default: no binarization)
>
> # irstlm
> #lm-binarizer = $irstlm-dir/compile-lm
>
> # kenlm, also set type to 8
> #lm-binarizer = $moses-bin-dir/build_binary
> #type = 8
>
> ### script to create quantized language model format (irstlm)
> # (default: no quantization)
> #
> #lm-quantizer = $irstlm-dir/quantize-lm
>
> ### script to use for converting into randomized table format
> # (default: no randomization)
> #
> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"
>
> #################################################################
> # FACTOR DEFINITION
>
> [INPUT-FACTOR]
>
> # also used for output factors
> temp-dir = $working-dir/training/factor
> [INPUT-FACTOR:stem]
>
> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
> 3"
> ### script that generates this factor
> #
> #mxpost = /home/pkoehn/bin/mxpost
> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
> 3"
> [OUTPUT-FACTOR:stem]
>
> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
> 3"
> ### script that generates this factor
> #
> #mxpost = /home/pkoehn/bin/mxpost
> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
> 3"
>
> #################################################################
> # TRANSLATION MODEL TRAINING
>
> [TRAINING]
>
> ### training script to be used: either a legacy script or
> # current moses training script (default)
> #
> script = $moses-script-dir/training/train-model.perl
>
> ### general options
> # these are options that are passed on to train-model.perl, for instance
> # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza
> # * "-sort-buffer-size 8G" to reduce on-disk sorting
> #
> #training-options = ""
>
> ### factored training: specify here which factors used
> # if none specified, single factor training is assumed
> # (one translation step, surface to surface)
> #
> input-factors = word stem
> output-factors = word stem
> alignment-factors = "stem -> stem"
> translation-factors = "word -> word"
> reordering-factors = "word -> word"
> #generation-factors =
> decoding-steps = "t0"
>
> ### parallelization of data preparation step
> # the two directions of the data preparation can be run in parallel
> # comment out if not needed
> #
> parallel = yes
>
> ### pre-computation for giza++
> # giza++ has a more efficient data structure that needs to be
> # initialized with snt2cooc. if run in parallel, this may reduces
> # memory requirements. set here the number of parts
> #
> #run-giza-in-parts = 5
>
> ### symmetrization method to obtain word alignments from giza output
> # (commonly used: grow-diag-final-and)
> #
> alignment-symmetrization-method = grow-diag-final-and
>
> ### use of berkeley aligner for word alignment
> #
> #use-berkeley = true
> #alignment-symmetrization-method = berkeley
> #berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh
> #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh
> #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar
> #berkeley-java-options = "-server -mx30000m -ea"
> #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"
> #berkeley-process-options = "-EMWordAligner.numThreads 8"
> #berkeley-posterior = 0.5
>
> ### if word alignment should be skipped,
> # point to word alignment files
> #
> #word-alignment = $working-dir/model/aligned.1
>
> ### create a bilingual concordancer for the model
> #
> #biconcor = $moses-script-dir/ems/biconcor/biconcor
>
> ### lexicalized reordering: specify orientation type
> # (default: only distance-based reordering model)
> #
> lexicalized-reordering = msd-bidirectional-fe
>
> ### hierarchical rule set
> #
> hierarchical-rule-set = true
>
> ### settings for rule extraction
> #
> #extract-settings = ""
>
> ### unknown word labels (target syntax only)
> # enables use of unknown word labels during decoding
> # label file is generated during rule extraction
> #
> #use-unknown-word-labels = true
>
> ### if phrase extraction should be skipped,
> # point to stem for extract files
> #
> # extracted-phrases =
>
> ### settings for rule scoring
> #
> score-settings = "--GoodTuring"
>
> ### include word alignment in phrase table
> #
> #include-word-alignment-in-rules = yes
>
> ### if phrase table training should be skipped,
> # point to phrase translation table
> #
> # phrase-translation-table =
>
> ### if reordering table training should be skipped,
> # point to reordering table
> #
> # reordering-table =
>
> ### if training should be skipped,
> # point to a configuration file that contains
> # pointers to all relevant model files
> #
> #config-with-reused-weights =
>
> #####################################################
> ### TUNING: finding good weights for model components
>
> [TUNING]
>
> ### instead of tuning with this setting, old weights may be recycled
> # specify here an old configuration file with matching weights
> #
> #weight-config = $working-dir/tuning/moses.filtered.ini.1
>
> ### tuning script to be used
> #
> tuning-script = $moses-script-dir/training/mert-moses.pl
> tuning-settings = "-mertdir $moses-bin-dir --filtercmd
> '$moses-script-dir/training/filter-model-given-input.pl'"
>
> ### specify the corpus used for tuning
> # it should contain 1000s of sentences
> #
> #input-sgm =
> raw-input = $wmt12-data/tuning/tuning.clean.$input-extension
> #tokenized-input =
> #factorized-input =
> #input =
> #
> #reference-sgm =
> raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension
> #tokenized-reference =
> #factorized-reference =
> #reference =
>
> ### size of n-best list used (typically 100)
> #
> nbest = 100
>
> ### ranges for weights for random initialization
> # if not specified, the tuning script will use generic ranges
> # it is not clear, if this matters
> #
> # lambda =
>
> ### additional flags for the filter script
> #
> #filter-settings = "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical"
>
> ### additional flags for the decoder
> #
> decoder-settings = ""
>
> ### if tuning should be skipped, specify this here
> # and also point to a configuration file that contains
> # pointers to all relevant model files
> #
> #config =
>
> #########################################################
> ## RECASER: restore case, this part only trains the model
>
> [RECASING]
>
> #decoder = $moses-bin-dir/moses
>
> ### training data
> # raw input needs to be still tokenized,
> # also also tokenized input may be specified
> #
> #tokenized = [LM:europarl:tokenized-corpus]
>
> # recase-config =
>
> #lm-training = $srilm-dir/ngram-count
>
> #######################################################
> ## TRUECASER: train model to truecase corpora and input
>
> [TRUECASER]
>
> ### script to train truecaser models
> #
> trainer = $moses-script-dir/recaser/train-truecaser.perl
>
> ### training data
> # data on which truecaser is trained
> # if no training data is specified, parallel corpus is used
> #
> # raw-stem =
> # tokenized-stem =
>
> ### trained model
> #
> # truecase-model =
>
> ######################################################################
> ## EVALUATION: translating a test set using the tuned system and score it
>
> [EVALUATION]
>
> ### number of jobs (if parallel execution on cluster)
> #
> #jobs = 10
>
> ### additional flags for the filter script
> #
> #filter-settings = ""
>
> ### additional decoder settings
> # switches for the Moses decoder
> # common choices:
> # "-threads N" for multi-threading
> # "-mbr" for MBR decoding
> # "-drop-unknown" for dropping unknown source words
> # "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube
> pruning
> #
> decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s
> 5000"
>
> ### specify size of n-best list, if produced
> #
> #nbest = 100
>
> ### multiple reference translations
> #
> #multiref = yes
>
> ### prepare system output for scoring
> # this may include detokenization and wrapping output in sgm
> # (needed for nist-bleu, ter, meteor)
> #
> detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
> $output-extension"
> #recaser = $moses-script-dir/recaser/recase.perl
> wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl
> $output-extension"
> #output-sgm =
>
> ### BLEU
> #
> nist-bleu = $moses-script-dir/generic/mteval-v13a.pl
> nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c"
> #multi-bleu = $moses-script-dir/generic/multi-bleu.perl
> #ibm-bleu =
>
> ### TER: translation error rate (BBN metric) based on edit distance
> # not yet integrated
> #
> # ter =
>
> ### METEOR: gives credit to stem / worknet synonym matches
> # not yet integrated
> #
> # meteor =
>
> ### Analysis: carry out various forms of analysis on the output
> #
> analysis = $moses-script-dir/ems/support/analysis.perl
> #
> # also report on input coverage
> analyze-coverage = yes
> #
> # also report on phrase mappings used
> report-segmentation = yes
> #
> # report precision of translations for each input word, broken down by
> # count of input word in corpus and model
> #report-precision-by-coverage = yes
> #
> # further precision breakdown by factor
> #precision-by-coverage-factor = pos
>
> [EVALUATION:newstest2011]
>
> ### input data
> #
> #input-sgm = "$wmt12-data/$input-extension-test.txt"
> #raw-input = $wmt12-data/$input-extension-test.txt
> tokenized-input = "$wmt12-data/de-test.txt"
> # factorized-input =
> #input = $wmt12-data/$input-extension-test.txt
>
> ### reference data
> #
> #reference-sgm = "$wmt12-data/$output-extension-test.txt"
> #raw-reference ="wmt12-data/$output-extension -test.txt
> tokenized-reference = "$wmt12-data/el-test.txt"
> #reference = $wmt12-data/el-test.txt
>
> ### analysis settings
> # may contain any of the general evaluation analysis settings
> # specific setting: base coverage statistics on earlier run
> #
> #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5
>
> ### wrapping frame
> # for nist-bleu and other scoring scripts, the output needs to be wrapped
> # in sgm markup (typically like the input sgm)
> #
> wrapping-frame = $tokenized-input
>
> ##########################################
> ### REPORTING: summarize evaluation scores
>
> [REPORTING]
>
> ### currently no parameters for reporting section
>
> Thank you,
>
> Dimitris Babaniotis
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] EMS fails on tuning

Reply via email to