Hi, there is a problem here:
# conversion of phrase table into binary on-disk format #ttable-binarizer = $moses-bin-dir/processPhraseTable # conversion of rule table into binary on-disk format ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2" You are using the ttable binarizer for the hierarchical/syntax model, but you use a phrase-based model. -phi On Sun, May 27, 2012 at 11:45 PM, Dimitris Babaniotis <[email protected]> wrote: > Hello, I'm trying to run experiments with EMS but the process stops on > tuning:tune. > > Here is the TUNING_tune.stderr file : > > main::create_extractor_script() called too early to check prototype at > /home/dimbaba/moses/scripts/training/mert-moses.pl line 674. > Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts > Asking moses for feature names and values from > /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 > Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config > /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 > -inputtype 0 -show-weights > ./features.list > MERT starting values and ranges for random generation: > d = 0.600 ( 0.00 .. 1.00) > lm = 0.250 ( 0.00 .. 1.00) > lm = 0.250 ( 0.00 .. 1.00) > w = -1.000 ( 0.00 .. 1.00) > tm = 0.200 ( 0.00 .. 1.00) > tm = 0.200 ( 0.00 .. 1.00) > tm = 0.200 ( 0.00 .. 1.00) > tm = 0.200 ( 0.00 .. 1.00) > tm = 0.200 ( 0.00 .. 1.00) > Saved: ./run1.moses.ini > Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 0.200000 > 0.200000 0.200000 0.200000 > DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 > 0.064516 0.064516 0.064516 0.064516 > Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config > /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 > -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 > 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 > -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 > > run1.out > Translating line 0 in thread id 140471666632448 > Check (*contextFactor[count-1])[factorType] != NULL failed in > moses/src/LM/SRI.cpp:155 > sh: line 1: 1648 Ακυρώθηκε (core dumped) /home/dimbaba/moses/dist/bin/moses > -v 0 -config > /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 > -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 > 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 > -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 > > run1.out > Exit code: 134 > The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 > -tm 0.064516 0.064516 0.064516 0.064516 0.064516 > cp: cannot stat > «/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν υπάρχει > τέτοιο αρχείο ή κατάλογος > > > ...and this is my configuration file: > > > ################################################ > ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ### > ################################################ > > [GENERAL] > > ### directory in which experiment is run > # > working-dir = /home/dimbaba/mosesFactored/experiment > > # specification of the language pair > input-extension = de > output-extension = el > pair-extension = de-el > > ### directories that contain tools and data > # > # moses > moses-src-dir = /home/dimbaba/moses > # > # moses binaries > moses-bin-dir = $moses-src-dir/dist/bin > # > # moses scripts > moses-script-dir = $moses-src-dir/scripts > # > # srilm > srilm-dir = /home/dimbaba/srilm/bin/i686-m64 > # > # irstlm > #irstlm-dir = $moses-src-dir/irstlm/bin > # > # randlm > #randlm-dir = $moses-src-dir/randlm/bin > # > # data > wmt12-data = /home/dimbaba/aligned/el-de > > ### basic tools > # > # moses decoder > decoder = $moses-bin-dir/moses > > # conversion of phrase table into binary on-disk format > #ttable-binarizer = $moses-bin-dir/processPhraseTable > > # conversion of rule table into binary on-disk format > ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2" > > # tokenizers - comment out if all your data is already tokenized > input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l > $input-extension" > output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l > $output-extension" > > # truecasers - comment out if you do not use the truecaser > input-truecaser = $moses-script-dir/recaser/truecase.perl > output-truecaser = $moses-script-dir/recaser/truecase.perl > detruecaser = $moses-script-dir/recaser/detruecase.perl > > ### generic parallelizer for cluster and multi-core machines > # you may specify a script that allows the parallel execution > # parallizable steps (see meta file). you also need specify > # the number of jobs (cluster) or cores (multicore) > # > #generic-parallelizer = > $moses-script-dir/ems/support/generic-parallelizer.perl > #generic-parallelizer = > $moses-script-dir/ems/support/generic-multicore-parallelizer.perl > > ### cluster settings (if run on a cluster machine) > # number of jobs to be submitted in parallel > # > #jobs = 10 > > # arguments to qsub when scheduling a job > #qsub-settings = "" > > # project for priviledges and usage accounting > #qsub-project = iccs_smt > > # memory and time > #qsub-memory = 4 > #qsub-hours = 48 > > ### multi-core settings > # when the generic parallelizer is used, the number of cores > # specified here > cores = 4 > > ################################################################# > # PARALLEL CORPUS PREPARATION: > # create a tokenized, sentence-aligned corpus, ready for training > > [CORPUS] > > ### long sentences are filtered out, since they slow down GIZA++ > # and are a less reliable source of data. set here the maximum > # length of a sentence > # > max-sentence-length = 100 > > [CORPUS:europarl] IGNORE > > ### command to run to get raw corpus files > # > # get-corpus-script = > > ### raw corpus files (untokenized, but sentence aligned) > # > raw-stem = $wmt12-data/training/training.clean10 > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-stem = > > ### if sentence filtering should be skipped, > # point to the clean training data > # > #clean-stem = > > ### if corpus preparation should be skipped, > # point to the prepared training data > # > #lowercased-stem = > > [CORPUS:nc] > raw-stem = $wmt12-data/training/training.clean10 > > [CORPUS:un] IGNORE > raw-stem = $wmt12-data/training/training.clean10 > > ################################################################# > # LANGUAGE MODEL TRAINING > > [LM] > > ### tool to be used for language model training > # srilm > lm-training = $srilm-dir/ngram-count > settings = "" > > # irstlm > #lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores > -irst-dir $irstlm-dir -temp-dir $working-dir/lm" > #settings = "" > > # order of the language model > order = 3 > > ### tool to be used for training randomized language model from scratch > # (more commonly, a SRILM is trained) > # > #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### script to use for binary table format for irstlm or kenlm > # (default: no binarization) > > # irstlm > #lm-binarizer = $irstlm-dir/compile-lm > > # kenlm, also set type to 8 > #lm-binarizer = $moses-bin-dir/build_binary > #type = 8 > > ### script to create quantized language model format (irstlm) > # (default: no quantization) > # > #lm-quantizer = $irstlm-dir/quantize-lm > > ### script to use for converting into randomized table format > # (default: no randomization) > # > #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ### each language model to be used has its own section here > > [LM:europarl] IGNORE > > ### command to run to get raw corpus files > # > #get-corpus-script = "" > > ### raw corpus (untokenized) > # > raw-corpus = $wmt12-data/training/training.clean.$output-extension > > ### tokenized corpus files (may contain long sentences) > # > #tokenized-corpus = > > ### if corpus preparation should be skipped, > # point to the prepared language model > # > #lm = > > [LM:nc] > raw-corpus = $wmt12-data/training/training.clean10.$output-extension > > [LM:un] IGNORE > raw-corpus = > $wmt12-data/training/undoc.2000.$pair-extension.$output-extension > > [LM:news] IGNORE > raw-corpus = $wmt12-data/training/news.$output-extension.shuffled > > [LM:nc=stem] > factors = "stem" > order = 3 > settings = "" > raw-corpus = $wmt12-data/training/training.clean.$output-extension > > ################################################################# > # INTERPOLATING LANGUAGE MODELS > > [INTERPOLATED-LM] IGNORE > > # if multiple language models are used, these may be combined > # by optimizing perplexity on a tuning set > # see, for instance [Koehn and Schwenk, IJCNLP 2008] > > ### script to interpolate language models > # if commented out, no interpolation is performed > # > script = $moses-script-dir/ems/support/interpolate-lm.perl > > ### tuning set > # you may use the same set that is used for mert tuning (reference set) > # > tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm > #raw-tuning = > #tokenized-tuning = > #factored-tuning = > #lowercased-tuning = > #split-tuning = > > ### group language models for hierarchical interpolation > # (flat interpolation is limited to 10 language models) > #group = "first,second fourth,fifth" > > ### script to use for binary table format for irstlm or kenlm > # (default: no binarization) > > # irstlm > #lm-binarizer = $irstlm-dir/compile-lm > > # kenlm, also set type to 8 > #lm-binarizer = $moses-bin-dir/build_binary > #type = 8 > > ### script to create quantized language model format (irstlm) > # (default: no quantization) > # > #lm-quantizer = $irstlm-dir/quantize-lm > > ### script to use for converting into randomized table format > # (default: no randomization) > # > #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" > > ################################################################# > # FACTOR DEFINITION > > [INPUT-FACTOR] > > # also used for output factors > temp-dir = $working-dir/training/factor > [INPUT-FACTOR:stem] > > factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl > 3" > ### script that generates this factor > # > #mxpost = /home/pkoehn/bin/mxpost > factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl > 3" > [OUTPUT-FACTOR:stem] > > factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl > 3" > ### script that generates this factor > # > #mxpost = /home/pkoehn/bin/mxpost > factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl > 3" > > ################################################################# > # TRANSLATION MODEL TRAINING > > [TRAINING] > > ### training script to be used: either a legacy script or > # current moses training script (default) > # > script = $moses-script-dir/training/train-model.perl > > ### general options > # these are options that are passed on to train-model.perl, for instance > # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza > # * "-sort-buffer-size 8G" to reduce on-disk sorting > # > #training-options = "" > > ### factored training: specify here which factors used > # if none specified, single factor training is assumed > # (one translation step, surface to surface) > # > input-factors = word stem > output-factors = word stem > alignment-factors = "stem -> stem" > translation-factors = "word -> word" > reordering-factors = "word -> word" > #generation-factors = > decoding-steps = "t0" > > ### parallelization of data preparation step > # the two directions of the data preparation can be run in parallel > # comment out if not needed > # > parallel = yes > > ### pre-computation for giza++ > # giza++ has a more efficient data structure that needs to be > # initialized with snt2cooc. if run in parallel, this may reduces > # memory requirements. set here the number of parts > # > #run-giza-in-parts = 5 > > ### symmetrization method to obtain word alignments from giza output > # (commonly used: grow-diag-final-and) > # > alignment-symmetrization-method = grow-diag-final-and > > ### use of berkeley aligner for word alignment > # > #use-berkeley = true > #alignment-symmetrization-method = berkeley > #berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh > #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh > #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar > #berkeley-java-options = "-server -mx30000m -ea" > #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8" > #berkeley-process-options = "-EMWordAligner.numThreads 8" > #berkeley-posterior = 0.5 > > ### if word alignment should be skipped, > # point to word alignment files > # > #word-alignment = $working-dir/model/aligned.1 > > ### create a bilingual concordancer for the model > # > #biconcor = $moses-script-dir/ems/biconcor/biconcor > > ### lexicalized reordering: specify orientation type > # (default: only distance-based reordering model) > # > lexicalized-reordering = msd-bidirectional-fe > > ### hierarchical rule set > # > hierarchical-rule-set = true > > ### settings for rule extraction > # > #extract-settings = "" > > ### unknown word labels (target syntax only) > # enables use of unknown word labels during decoding > # label file is generated during rule extraction > # > #use-unknown-word-labels = true > > ### if phrase extraction should be skipped, > # point to stem for extract files > # > # extracted-phrases = > > ### settings for rule scoring > # > score-settings = "--GoodTuring" > > ### include word alignment in phrase table > # > #include-word-alignment-in-rules = yes > > ### if phrase table training should be skipped, > # point to phrase translation table > # > # phrase-translation-table = > > ### if reordering table training should be skipped, > # point to reordering table > # > # reordering-table = > > ### if training should be skipped, > # point to a configuration file that contains > # pointers to all relevant model files > # > #config-with-reused-weights = > > ##################################################### > ### TUNING: finding good weights for model components > > [TUNING] > > ### instead of tuning with this setting, old weights may be recycled > # specify here an old configuration file with matching weights > # > #weight-config = $working-dir/tuning/moses.filtered.ini.1 > > ### tuning script to be used > # > tuning-script = $moses-script-dir/training/mert-moses.pl > tuning-settings = "-mertdir $moses-bin-dir --filtercmd > '$moses-script-dir/training/filter-model-given-input.pl'" > > ### specify the corpus used for tuning > # it should contain 1000s of sentences > # > #input-sgm = > raw-input = $wmt12-data/tuning/tuning.clean.$input-extension > #tokenized-input = > #factorized-input = > #input = > # > #reference-sgm = > raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension > #tokenized-reference = > #factorized-reference = > #reference = > > ### size of n-best list used (typically 100) > # > nbest = 100 > > ### ranges for weights for random initialization > # if not specified, the tuning script will use generic ranges > # it is not clear, if this matters > # > # lambda = > > ### additional flags for the filter script > # > #filter-settings = "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical" > > ### additional flags for the decoder > # > decoder-settings = "" > > ### if tuning should be skipped, specify this here > # and also point to a configuration file that contains > # pointers to all relevant model files > # > #config = > > ######################################################### > ## RECASER: restore case, this part only trains the model > > [RECASING] > > #decoder = $moses-bin-dir/moses > > ### training data > # raw input needs to be still tokenized, > # also also tokenized input may be specified > # > #tokenized = [LM:europarl:tokenized-corpus] > > # recase-config = > > #lm-training = $srilm-dir/ngram-count > > ####################################################### > ## TRUECASER: train model to truecase corpora and input > > [TRUECASER] > > ### script to train truecaser models > # > trainer = $moses-script-dir/recaser/train-truecaser.perl > > ### training data > # data on which truecaser is trained > # if no training data is specified, parallel corpus is used > # > # raw-stem = > # tokenized-stem = > > ### trained model > # > # truecase-model = > > ###################################################################### > ## EVALUATION: translating a test set using the tuned system and score it > > [EVALUATION] > > ### number of jobs (if parallel execution on cluster) > # > #jobs = 10 > > ### additional flags for the filter script > # > #filter-settings = "" > > ### additional decoder settings > # switches for the Moses decoder > # common choices: > # "-threads N" for multi-threading > # "-mbr" for MBR decoding > # "-drop-unknown" for dropping unknown source words > # "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube > pruning > # > decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s > 5000" > > ### specify size of n-best list, if produced > # > #nbest = 100 > > ### multiple reference translations > # > #multiref = yes > > ### prepare system output for scoring > # this may include detokenization and wrapping output in sgm > # (needed for nist-bleu, ter, meteor) > # > detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l > $output-extension" > #recaser = $moses-script-dir/recaser/recase.perl > wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl > $output-extension" > #output-sgm = > > ### BLEU > # > nist-bleu = $moses-script-dir/generic/mteval-v13a.pl > nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c" > #multi-bleu = $moses-script-dir/generic/multi-bleu.perl > #ibm-bleu = > > ### TER: translation error rate (BBN metric) based on edit distance > # not yet integrated > # > # ter = > > ### METEOR: gives credit to stem / worknet synonym matches > # not yet integrated > # > # meteor = > > ### Analysis: carry out various forms of analysis on the output > # > analysis = $moses-script-dir/ems/support/analysis.perl > # > # also report on input coverage > analyze-coverage = yes > # > # also report on phrase mappings used > report-segmentation = yes > # > # report precision of translations for each input word, broken down by > # count of input word in corpus and model > #report-precision-by-coverage = yes > # > # further precision breakdown by factor > #precision-by-coverage-factor = pos > > [EVALUATION:newstest2011] > > ### input data > # > #input-sgm = "$wmt12-data/$input-extension-test.txt" > #raw-input = $wmt12-data/$input-extension-test.txt > tokenized-input = "$wmt12-data/de-test.txt" > # factorized-input = > #input = $wmt12-data/$input-extension-test.txt > > ### reference data > # > #reference-sgm = "$wmt12-data/$output-extension-test.txt" > #raw-reference ="wmt12-data/$output-extension -test.txt > tokenized-reference = "$wmt12-data/el-test.txt" > #reference = $wmt12-data/el-test.txt > > ### analysis settings > # may contain any of the general evaluation analysis settings > # specific setting: base coverage statistics on earlier run > # > #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5 > > ### wrapping frame > # for nist-bleu and other scoring scripts, the output needs to be wrapped > # in sgm markup (typically like the input sgm) > # > wrapping-frame = $tokenized-input > > ########################################## > ### REPORTING: summarize evaluation scores > > [REPORTING] > > ### currently no parameters for reporting section > > Thank you, > > Dimitris Babaniotis > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
