Στις 28/05/2012 10:01 μμ, ο/η Philipp Koehn έγραψε: > Hi, > > there is a problem here: > > # conversion of phrase table into binary on-disk format > #ttable-binarizer = $moses-bin-dir/processPhraseTable > > # conversion of rule table into binary on-disk format > ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2" > > You are using the ttable binarizer for the hierarchical/syntax model, > but you use a phrase-based model. > > -phi > > On Sun, May 27, 2012 at 11:45 PM, Dimitris Babaniotis > <[email protected]> wrote: >> Hello, I'm trying to run experiments with EMS but the process stops on >> tuning:tune. >> >> Here is the TUNING_tune.stderr file : >> >> main::create_extractor_script() called too early to check prototype at >> /home/dimbaba/moses/scripts/training/mert-moses.pl line 674. >> Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts >> Asking moses for feature names and values from >> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 >> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config >> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 >> -inputtype 0 -show-weights> ./features.list >> MERT starting values and ranges for random generation: >> d = 0.600 ( 0.00 .. 1.00) >> lm = 0.250 ( 0.00 .. 1.00) >> lm = 0.250 ( 0.00 .. 1.00) >> w = -1.000 ( 0.00 .. 1.00) >> tm = 0.200 ( 0.00 .. 1.00) >> tm = 0.200 ( 0.00 .. 1.00) >> tm = 0.200 ( 0.00 .. 1.00) >> tm = 0.200 ( 0.00 .. 1.00) >> tm = 0.200 ( 0.00 .. 1.00) >> Saved: ./run1.moses.ini >> Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 0.200000 >> 0.200000 0.200000 0.200000 >> DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 >> 0.064516 0.064516 0.064516 0.064516 >> Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config >> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 >> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 >> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 >> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1> >> run1.out >> Translating line 0 in thread id 140471666632448 >> Check (*contextFactor[count-1])[factorType] != NULL failed in >> moses/src/LM/SRI.cpp:155 >> sh: line 1: 1648 Ακυρώθηκε (core dumped) /home/dimbaba/moses/dist/bin/moses >> -v 0 -config >> /home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4 >> -inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516 >> 0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100 >> -input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1> >> run1.out >> Exit code: 134 >> The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 >> -tm 0.064516 0.064516 0.064516 0.064516 0.064516 >> cp: cannot stat >> «/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν υπάρχει >> τέτοιο αρχείο ή κατάλογος >> >> >> ...and this is my configuration file: >> >> >> ################################################ >> ### CONFIGURATION FILE FOR AN SMT EXPERIMENT ### >> ################################################ >> >> [GENERAL] >> >> ### directory in which experiment is run >> # >> working-dir = /home/dimbaba/mosesFactored/experiment >> >> # specification of the language pair >> input-extension = de >> output-extension = el >> pair-extension = de-el >> >> ### directories that contain tools and data >> # >> # moses >> moses-src-dir = /home/dimbaba/moses >> # >> # moses binaries >> moses-bin-dir = $moses-src-dir/dist/bin >> # >> # moses scripts >> moses-script-dir = $moses-src-dir/scripts >> # >> # srilm >> srilm-dir = /home/dimbaba/srilm/bin/i686-m64 >> # >> # irstlm >> #irstlm-dir = $moses-src-dir/irstlm/bin >> # >> # randlm >> #randlm-dir = $moses-src-dir/randlm/bin >> # >> # data >> wmt12-data = /home/dimbaba/aligned/el-de >> >> ### basic tools >> # >> # moses decoder >> decoder = $moses-bin-dir/moses >> >> # conversion of phrase table into binary on-disk format >> #ttable-binarizer = $moses-bin-dir/processPhraseTable >> >> # conversion of rule table into binary on-disk format >> ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2" >> >> # tokenizers - comment out if all your data is already tokenized >> input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l >> $input-extension" >> output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l >> $output-extension" >> >> # truecasers - comment out if you do not use the truecaser >> input-truecaser = $moses-script-dir/recaser/truecase.perl >> output-truecaser = $moses-script-dir/recaser/truecase.perl >> detruecaser = $moses-script-dir/recaser/detruecase.perl >> >> ### generic parallelizer for cluster and multi-core machines >> # you may specify a script that allows the parallel execution >> # parallizable steps (see meta file). you also need specify >> # the number of jobs (cluster) or cores (multicore) >> # >> #generic-parallelizer = >> $moses-script-dir/ems/support/generic-parallelizer.perl >> #generic-parallelizer = >> $moses-script-dir/ems/support/generic-multicore-parallelizer.perl >> >> ### cluster settings (if run on a cluster machine) >> # number of jobs to be submitted in parallel >> # >> #jobs = 10 >> >> # arguments to qsub when scheduling a job >> #qsub-settings = "" >> >> # project for priviledges and usage accounting >> #qsub-project = iccs_smt >> >> # memory and time >> #qsub-memory = 4 >> #qsub-hours = 48 >> >> ### multi-core settings >> # when the generic parallelizer is used, the number of cores >> # specified here >> cores = 4 >> >> ################################################################# >> # PARALLEL CORPUS PREPARATION: >> # create a tokenized, sentence-aligned corpus, ready for training >> >> [CORPUS] >> >> ### long sentences are filtered out, since they slow down GIZA++ >> # and are a less reliable source of data. set here the maximum >> # length of a sentence >> # >> max-sentence-length = 100 >> >> [CORPUS:europarl] IGNORE >> >> ### command to run to get raw corpus files >> # >> # get-corpus-script = >> >> ### raw corpus files (untokenized, but sentence aligned) >> # >> raw-stem = $wmt12-data/training/training.clean10 >> >> ### tokenized corpus files (may contain long sentences) >> # >> #tokenized-stem = >> >> ### if sentence filtering should be skipped, >> # point to the clean training data >> # >> #clean-stem = >> >> ### if corpus preparation should be skipped, >> # point to the prepared training data >> # >> #lowercased-stem = >> >> [CORPUS:nc] >> raw-stem = $wmt12-data/training/training.clean10 >> >> [CORPUS:un] IGNORE >> raw-stem = $wmt12-data/training/training.clean10 >> >> ################################################################# >> # LANGUAGE MODEL TRAINING >> >> [LM] >> >> ### tool to be used for language model training >> # srilm >> lm-training = $srilm-dir/ngram-count >> settings = "" >> >> # irstlm >> #lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores >> -irst-dir $irstlm-dir -temp-dir $working-dir/lm" >> #settings = "" >> >> # order of the language model >> order = 3 >> >> ### tool to be used for training randomized language model from scratch >> # (more commonly, a SRILM is trained) >> # >> #rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8" >> >> ### script to use for binary table format for irstlm or kenlm >> # (default: no binarization) >> >> # irstlm >> #lm-binarizer = $irstlm-dir/compile-lm >> >> # kenlm, also set type to 8 >> #lm-binarizer = $moses-bin-dir/build_binary >> #type = 8 >> >> ### script to create quantized language model format (irstlm) >> # (default: no quantization) >> # >> #lm-quantizer = $irstlm-dir/quantize-lm >> >> ### script to use for converting into randomized table format >> # (default: no randomization) >> # >> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" >> >> ### each language model to be used has its own section here >> >> [LM:europarl] IGNORE >> >> ### command to run to get raw corpus files >> # >> #get-corpus-script = "" >> >> ### raw corpus (untokenized) >> # >> raw-corpus = $wmt12-data/training/training.clean.$output-extension >> >> ### tokenized corpus files (may contain long sentences) >> # >> #tokenized-corpus = >> >> ### if corpus preparation should be skipped, >> # point to the prepared language model >> # >> #lm = >> >> [LM:nc] >> raw-corpus = $wmt12-data/training/training.clean10.$output-extension >> >> [LM:un] IGNORE >> raw-corpus = >> $wmt12-data/training/undoc.2000.$pair-extension.$output-extension >> >> [LM:news] IGNORE >> raw-corpus = $wmt12-data/training/news.$output-extension.shuffled >> >> [LM:nc=stem] >> factors = "stem" >> order = 3 >> settings = "" >> raw-corpus = $wmt12-data/training/training.clean.$output-extension >> >> ################################################################# >> # INTERPOLATING LANGUAGE MODELS >> >> [INTERPOLATED-LM] IGNORE >> >> # if multiple language models are used, these may be combined >> # by optimizing perplexity on a tuning set >> # see, for instance [Koehn and Schwenk, IJCNLP 2008] >> >> ### script to interpolate language models >> # if commented out, no interpolation is performed >> # >> script = $moses-script-dir/ems/support/interpolate-lm.perl >> >> ### tuning set >> # you may use the same set that is used for mert tuning (reference set) >> # >> tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm >> #raw-tuning = >> #tokenized-tuning = >> #factored-tuning = >> #lowercased-tuning = >> #split-tuning = >> >> ### group language models for hierarchical interpolation >> # (flat interpolation is limited to 10 language models) >> #group = "first,second fourth,fifth" >> >> ### script to use for binary table format for irstlm or kenlm >> # (default: no binarization) >> >> # irstlm >> #lm-binarizer = $irstlm-dir/compile-lm >> >> # kenlm, also set type to 8 >> #lm-binarizer = $moses-bin-dir/build_binary >> #type = 8 >> >> ### script to create quantized language model format (irstlm) >> # (default: no quantization) >> # >> #lm-quantizer = $irstlm-dir/quantize-lm >> >> ### script to use for converting into randomized table format >> # (default: no randomization) >> # >> #lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8" >> >> ################################################################# >> # FACTOR DEFINITION >> >> [INPUT-FACTOR] >> >> # also used for output factors >> temp-dir = $working-dir/training/factor >> [INPUT-FACTOR:stem] >> >> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl >> 3" >> ### script that generates this factor >> # >> #mxpost = /home/pkoehn/bin/mxpost >> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl >> 3" >> [OUTPUT-FACTOR:stem] >> >> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl >> 3" >> ### script that generates this factor >> # >> #mxpost = /home/pkoehn/bin/mxpost >> factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl >> 3" >> >> ################################################################# >> # TRANSLATION MODEL TRAINING >> >> [TRAINING] >> >> ### training script to be used: either a legacy script or >> # current moses training script (default) >> # >> script = $moses-script-dir/training/train-model.perl >> >> ### general options >> # these are options that are passed on to train-model.perl, for instance >> # * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza >> # * "-sort-buffer-size 8G" to reduce on-disk sorting >> # >> #training-options = "" >> >> ### factored training: specify here which factors used >> # if none specified, single factor training is assumed >> # (one translation step, surface to surface) >> # >> input-factors = word stem >> output-factors = word stem >> alignment-factors = "stem -> stem" >> translation-factors = "word -> word" >> reordering-factors = "word -> word" >> #generation-factors = >> decoding-steps = "t0" >> >> ### parallelization of data preparation step >> # the two directions of the data preparation can be run in parallel >> # comment out if not needed >> # >> parallel = yes >> >> ### pre-computation for giza++ >> # giza++ has a more efficient data structure that needs to be >> # initialized with snt2cooc. if run in parallel, this may reduces >> # memory requirements. set here the number of parts >> # >> #run-giza-in-parts = 5 >> >> ### symmetrization method to obtain word alignments from giza output >> # (commonly used: grow-diag-final-and) >> # >> alignment-symmetrization-method = grow-diag-final-and >> >> ### use of berkeley aligner for word alignment >> # >> #use-berkeley = true >> #alignment-symmetrization-method = berkeley >> #berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh >> #berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh >> #berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar >> #berkeley-java-options = "-server -mx30000m -ea" >> #berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8" >> #berkeley-process-options = "-EMWordAligner.numThreads 8" >> #berkeley-posterior = 0.5 >> >> ### if word alignment should be skipped, >> # point to word alignment files >> # >> #word-alignment = $working-dir/model/aligned.1 >> >> ### create a bilingual concordancer for the model >> # >> #biconcor = $moses-script-dir/ems/biconcor/biconcor >> >> ### lexicalized reordering: specify orientation type >> # (default: only distance-based reordering model) >> # >> lexicalized-reordering = msd-bidirectional-fe >> >> ### hierarchical rule set >> # >> hierarchical-rule-set = true >> >> ### settings for rule extraction >> # >> #extract-settings = "" >> >> ### unknown word labels (target syntax only) >> # enables use of unknown word labels during decoding >> # label file is generated during rule extraction >> # >> #use-unknown-word-labels = true >> >> ### if phrase extraction should be skipped, >> # point to stem for extract files >> # >> # extracted-phrases = >> >> ### settings for rule scoring >> # >> score-settings = "--GoodTuring" >> >> ### include word alignment in phrase table >> # >> #include-word-alignment-in-rules = yes >> >> ### if phrase table training should be skipped, >> # point to phrase translation table >> # >> # phrase-translation-table = >> >> ### if reordering table training should be skipped, >> # point to reordering table >> # >> # reordering-table = >> >> ### if training should be skipped, >> # point to a configuration file that contains >> # pointers to all relevant model files >> # >> #config-with-reused-weights = >> >> ##################################################### >> ### TUNING: finding good weights for model components >> >> [TUNING] >> >> ### instead of tuning with this setting, old weights may be recycled >> # specify here an old configuration file with matching weights >> # >> #weight-config = $working-dir/tuning/moses.filtered.ini.1 >> >> ### tuning script to be used >> # >> tuning-script = $moses-script-dir/training/mert-moses.pl >> tuning-settings = "-mertdir $moses-bin-dir --filtercmd >> '$moses-script-dir/training/filter-model-given-input.pl'" >> >> ### specify the corpus used for tuning >> # it should contain 1000s of sentences >> # >> #input-sgm = >> raw-input = $wmt12-data/tuning/tuning.clean.$input-extension >> #tokenized-input = >> #factorized-input = >> #input = >> # >> #reference-sgm = >> raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension >> #tokenized-reference = >> #factorized-reference = >> #reference = >> >> ### size of n-best list used (typically 100) >> # >> nbest = 100 >> >> ### ranges for weights for random initialization >> # if not specified, the tuning script will use generic ranges >> # it is not clear, if this matters >> # >> # lambda = >> >> ### additional flags for the filter script >> # >> #filter-settings = "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical" >> >> ### additional flags for the decoder >> # >> decoder-settings = "" >> >> ### if tuning should be skipped, specify this here >> # and also point to a configuration file that contains >> # pointers to all relevant model files >> # >> #config = >> >> ######################################################### >> ## RECASER: restore case, this part only trains the model >> >> [RECASING] >> >> #decoder = $moses-bin-dir/moses >> >> ### training data >> # raw input needs to be still tokenized, >> # also also tokenized input may be specified >> # >> #tokenized = [LM:europarl:tokenized-corpus] >> >> # recase-config = >> >> #lm-training = $srilm-dir/ngram-count >> >> ####################################################### >> ## TRUECASER: train model to truecase corpora and input >> >> [TRUECASER] >> >> ### script to train truecaser models >> # >> trainer = $moses-script-dir/recaser/train-truecaser.perl >> >> ### training data >> # data on which truecaser is trained >> # if no training data is specified, parallel corpus is used >> # >> # raw-stem = >> # tokenized-stem = >> >> ### trained model >> # >> # truecase-model = >> >> ###################################################################### >> ## EVALUATION: translating a test set using the tuned system and score it >> >> [EVALUATION] >> >> ### number of jobs (if parallel execution on cluster) >> # >> #jobs = 10 >> >> ### additional flags for the filter script >> # >> #filter-settings = "" >> >> ### additional decoder settings >> # switches for the Moses decoder >> # common choices: >> # "-threads N" for multi-threading >> # "-mbr" for MBR decoding >> # "-drop-unknown" for dropping unknown source words >> # "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube >> pruning >> # >> decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s >> 5000" >> >> ### specify size of n-best list, if produced >> # >> #nbest = 100 >> >> ### multiple reference translations >> # >> #multiref = yes >> >> ### prepare system output for scoring >> # this may include detokenization and wrapping output in sgm >> # (needed for nist-bleu, ter, meteor) >> # >> detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l >> $output-extension" >> #recaser = $moses-script-dir/recaser/recase.perl >> wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl >> $output-extension" >> #output-sgm = >> >> ### BLEU >> # >> nist-bleu = $moses-script-dir/generic/mteval-v13a.pl >> nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c" >> #multi-bleu = $moses-script-dir/generic/multi-bleu.perl >> #ibm-bleu = >> >> ### TER: translation error rate (BBN metric) based on edit distance >> # not yet integrated >> # >> # ter = >> >> ### METEOR: gives credit to stem / worknet synonym matches >> # not yet integrated >> # >> # meteor = >> >> ### Analysis: carry out various forms of analysis on the output >> # >> analysis = $moses-script-dir/ems/support/analysis.perl >> # >> # also report on input coverage >> analyze-coverage = yes >> # >> # also report on phrase mappings used >> report-segmentation = yes >> # >> # report precision of translations for each input word, broken down by >> # count of input word in corpus and model >> #report-precision-by-coverage = yes >> # >> # further precision breakdown by factor >> #precision-by-coverage-factor = pos >> >> [EVALUATION:newstest2011] >> >> ### input data >> # >> #input-sgm = "$wmt12-data/$input-extension-test.txt" >> #raw-input = $wmt12-data/$input-extension-test.txt >> tokenized-input = "$wmt12-data/de-test.txt" >> # factorized-input = >> #input = $wmt12-data/$input-extension-test.txt >> >> ### reference data >> # >> #reference-sgm = "$wmt12-data/$output-extension-test.txt" >> #raw-reference ="wmt12-data/$output-extension -test.txt >> tokenized-reference = "$wmt12-data/el-test.txt" >> #reference = $wmt12-data/el-test.txt >> >> ### analysis settings >> # may contain any of the general evaluation analysis settings >> # specific setting: base coverage statistics on earlier run >> # >> #precision-by-coverage-base = $working-dir/evaluation/test.analysis.5 >> >> ### wrapping frame >> # for nist-bleu and other scoring scripts, the output needs to be wrapped >> # in sgm markup (typically like the input sgm) >> # >> wrapping-frame = $tokenized-input >> >> ########################################## >> ### REPORTING: summarize evaluation scores >> >> [REPORTING] >> >> ### currently no parameters for reporting section >> >> Thank you, >> >> Dimitris Babaniotis >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> Hi, thank you for your answer,
I fixed the problem that you mentioned but the problem still exists. I searched more and i found that the error occurs when the decoder tries to to translate a sentence. The problem exists with or without EMS. Dimitris _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
