[Moses-support] EMS fails on tuning

Dimitris Babaniotis Sun, 27 May 2012 23:47:05 -0700

Hello, I'm trying to run experiments with EMS but the process stops on
tuning:tune.


Here is the TUNING_tune.stderr file :

main::create_extractor_script() called too early to check prototype at
/home/dimbaba/moses/scripts/training/mert-moses.pl line 674.
Using SCRIPTS_ROOTDIR: /home/dimbaba/moses/scripts
Asking moses for feature names and values from
/home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
/home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
-inputtype 0 -show-weights > ./features.list
MERT starting values and ranges for random generation:
d = 0.600 ( 0.00 .. 1.00)
lm = 0.250 ( 0.00 .. 1.00)
lm = 0.250 ( 0.00 .. 1.00)
w = -1.000 ( 0.00 .. 1.00)
tm = 0.200 ( 0.00 .. 1.00)
tm = 0.200 ( 0.00 .. 1.00)
tm = 0.200 ( 0.00 .. 1.00)
tm = 0.200 ( 0.00 .. 1.00)
tm = 0.200 ( 0.00 .. 1.00)
Saved: ./run1.moses.ini
Normalizing lambdas: 0.600000 0.250000 0.250000 -1.000000 0.200000 0.200000
0.200000 0.200000 0.200000
DECODER_CFG = -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
0.064516 0.064516 0.064516 0.064516
Executing: /home/dimbaba/moses/dist/bin/moses -v 0 -config
/home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
-inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
-input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 >
run1.out
Translating line 0 in thread id 140471666632448
Check (*contextFactor[count-1])[factorType] != NULL failed in
moses/src/LM/SRI.cpp:155
sh: line 1: 1648 Ακυρώθηκε (core dumped) /home/dimbaba/moses/dist/bin/moses
-v 0 -config
/home/dimbaba/mosesFactored/experiment/tuning/moses.filtered.ini.4
-inputtype 0 -w -0.322581 -lm 0.080645 0.080645 -d 0.193548 -tm 0.064516
0.064516 0.064516 0.064516 0.064516 -n-best-list run1.best100.out 100
-input-file /home/dimbaba/mosesFactored/experiment/tuning/input.tc.1 >
run1.out
Exit code: 134
The decoder died. CONFIG WAS -w -0.322581 -lm 0.080645 0.080645 -d 0.193548
-tm 0.064516 0.064516 0.064516 0.064516 0.064516
cp: cannot stat
«/home/dimbaba/mosesFactored/experiment/tuning/tmp.4/moses.ini»: Δεν
υπάρχει τέτοιο αρχείο ή κατάλογος


...and this is my configuration file:


################################################
### CONFIGURATION FILE FOR AN SMT EXPERIMENT ###
################################################

[GENERAL]

### directory in which experiment is run
#
working-dir = /home/dimbaba/mosesFactored/experiment

# specification of the language pair
input-extension = de
output-extension = el
pair-extension = de-el

### directories that contain tools and data
#
# moses
moses-src-dir = /home/dimbaba/moses
#
# moses binaries
moses-bin-dir = $moses-src-dir/dist/bin
#
# moses scripts
moses-script-dir = $moses-src-dir/scripts
#
# srilm
srilm-dir = /home/dimbaba/srilm/bin/i686-m64
#
# irstlm
#irstlm-dir = $moses-src-dir/irstlm/bin
#
# randlm
#randlm-dir = $moses-src-dir/randlm/bin
#
# data
wmt12-data = /home/dimbaba/aligned/el-de

### basic tools
#
# moses decoder
decoder = $moses-bin-dir/moses

# conversion of phrase table into binary on-disk format
#ttable-binarizer = $moses-bin-dir/processPhraseTable

# conversion of rule table into binary on-disk format
ttable-binarizer = "$moses-bin-dir/CreateOnDisk 1 1 5 100 2"

# tokenizers - comment out if all your data is already tokenized
input-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
$input-extension"
output-tokenizer = "$moses-script-dir/tokenizer/tokenizer.perl -a -l
$output-extension"

# truecasers - comment out if you do not use the truecaser
input-truecaser = $moses-script-dir/recaser/truecase.perl
output-truecaser = $moses-script-dir/recaser/truecase.perl
detruecaser = $moses-script-dir/recaser/detruecase.perl

### generic parallelizer for cluster and multi-core machines
# you may specify a script that allows the parallel execution
# parallizable steps (see meta file). you also need specify
# the number of jobs (cluster) or cores (multicore)
#
#generic-parallelizer =
$moses-script-dir/ems/support/generic-parallelizer.perl
#generic-parallelizer =
$moses-script-dir/ems/support/generic-multicore-parallelizer.perl

### cluster settings (if run on a cluster machine)
# number of jobs to be submitted in parallel
#
#jobs = 10

# arguments to qsub when scheduling a job
#qsub-settings = ""

# project for priviledges and usage accounting
#qsub-project = iccs_smt

# memory and time
#qsub-memory = 4
#qsub-hours = 48

### multi-core settings
# when the generic parallelizer is used, the number of cores
# specified here
cores = 4

#################################################################
# PARALLEL CORPUS PREPARATION:
# create a tokenized, sentence-aligned corpus, ready for training

[CORPUS]

### long sentences are filtered out, since they slow down GIZA++
# and are a less reliable source of data. set here the maximum
# length of a sentence
#
max-sentence-length = 100

[CORPUS:europarl] IGNORE

### command to run to get raw corpus files
#
# get-corpus-script =

### raw corpus files (untokenized, but sentence aligned)
#
raw-stem = $wmt12-data/training/training.clean10

### tokenized corpus files (may contain long sentences)
#
#tokenized-stem =

### if sentence filtering should be skipped,
# point to the clean training data
#
#clean-stem =

### if corpus preparation should be skipped,
# point to the prepared training data
#
#lowercased-stem =

[CORPUS:nc]
raw-stem = $wmt12-data/training/training.clean10

[CORPUS:un] IGNORE
raw-stem = $wmt12-data/training/training.clean10

#################################################################
# LANGUAGE MODEL TRAINING

[LM]

### tool to be used for language model training
# srilm
lm-training = $srilm-dir/ngram-count
settings = ""

# irstlm
#lm-training = "$moses-script-dir/generic/trainlm-irst.perl -cores $cores
-irst-dir $irstlm-dir -temp-dir $working-dir/lm"
#settings = ""

# order of the language model
order = 3

### tool to be used for training randomized language model from scratch
# (more commonly, a SRILM is trained)
#
#rlm-training = "$randlm-dir/buildlm -falsepos 8 -values 8"

### script to use for binary table format for irstlm or kenlm
# (default: no binarization)

# irstlm
#lm-binarizer = $irstlm-dir/compile-lm

# kenlm, also set type to 8
#lm-binarizer = $moses-bin-dir/build_binary
#type = 8

### script to create quantized language model format (irstlm)
# (default: no quantization)
#
#lm-quantizer = $irstlm-dir/quantize-lm

### script to use for converting into randomized table format
# (default: no randomization)
#
#lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"

### each language model to be used has its own section here

[LM:europarl] IGNORE

### command to run to get raw corpus files
#
#get-corpus-script = ""

### raw corpus (untokenized)
#
raw-corpus = $wmt12-data/training/training.clean.$output-extension

### tokenized corpus files (may contain long sentences)
#
#tokenized-corpus =

### if corpus preparation should be skipped,
# point to the prepared language model
#
#lm =

[LM:nc]
raw-corpus = $wmt12-data/training/training.clean10.$output-extension

[LM:un] IGNORE
raw-corpus =
$wmt12-data/training/undoc.2000.$pair-extension.$output-extension

[LM:news] IGNORE
raw-corpus = $wmt12-data/training/news.$output-extension.shuffled

[LM:nc=stem]
factors = "stem"
order = 3
settings = ""
raw-corpus = $wmt12-data/training/training.clean.$output-extension

#################################################################
# INTERPOLATING LANGUAGE MODELS

[INTERPOLATED-LM] IGNORE

# if multiple language models are used, these may be combined
# by optimizing perplexity on a tuning set
# see, for instance [Koehn and Schwenk, IJCNLP 2008]

### script to interpolate language models
# if commented out, no interpolation is performed
#
script = $moses-script-dir/ems/support/interpolate-lm.perl

### tuning set
# you may use the same set that is used for mert tuning (reference set)
#
tuning-sgm = $wmt12-data/dev/newstest2010-ref.$output-extension.sgm
#raw-tuning =
#tokenized-tuning =
#factored-tuning =
#lowercased-tuning =
#split-tuning =

### group language models for hierarchical interpolation
# (flat interpolation is limited to 10 language models)
#group = "first,second fourth,fifth"

### script to use for binary table format for irstlm or kenlm
# (default: no binarization)

# irstlm
#lm-binarizer = $irstlm-dir/compile-lm

# kenlm, also set type to 8
#lm-binarizer = $moses-bin-dir/build_binary
#type = 8

### script to create quantized language model format (irstlm)
# (default: no quantization)
#
#lm-quantizer = $irstlm-dir/quantize-lm

### script to use for converting into randomized table format
# (default: no randomization)
#
#lm-randomizer = "$randlm-dir/buildlm -falsepos 8 -values 8"

#################################################################
# FACTOR DEFINITION

[INPUT-FACTOR]

# also used for output factors
temp-dir = $working-dir/training/factor
[INPUT-FACTOR:stem]

factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
3"
### script that generates this factor
#
#mxpost = /home/pkoehn/bin/mxpost
factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
3"
[OUTPUT-FACTOR:stem]

factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
3"
### script that generates this factor
#
#mxpost = /home/pkoehn/bin/mxpost
factor-script = "$moses-script-dir/training/wrappers/make-factor-stem.perl
3"

#################################################################
# TRANSLATION MODEL TRAINING

[TRAINING]

### training script to be used: either a legacy script or
# current moses training script (default)
#
script = $moses-script-dir/training/train-model.perl

### general options
# these are options that are passed on to train-model.perl, for instance
# * "-mgiza -mgiza-cpus 8" to use mgiza instead of giza
# * "-sort-buffer-size 8G" to reduce on-disk sorting
#
#training-options = ""

### factored training: specify here which factors used
# if none specified, single factor training is assumed
# (one translation step, surface to surface)
#
input-factors = word stem
output-factors = word stem
alignment-factors = "stem -> stem"
translation-factors = "word -> word"
reordering-factors = "word -> word"
#generation-factors =
decoding-steps = "t0"

### parallelization of data preparation step
# the two directions of the data preparation can be run in parallel
# comment out if not needed
#
parallel = yes

### pre-computation for giza++
# giza++ has a more efficient data structure that needs to be
# initialized with snt2cooc. if run in parallel, this may reduces
# memory requirements. set here the number of parts
#
#run-giza-in-parts = 5

### symmetrization method to obtain word alignments from giza output
# (commonly used: grow-diag-final-and)
#
alignment-symmetrization-method = grow-diag-final-and

### use of berkeley aligner for word alignment
#
#use-berkeley = true
#alignment-symmetrization-method = berkeley
#berkeley-train = $moses-script-dir/ems/support/berkeley-train.sh
#berkeley-process = $moses-script-dir/ems/support/berkeley-process.sh
#berkeley-jar = /your/path/to/berkeleyaligner-1.1/berkeleyaligner.jar
#berkeley-java-options = "-server -mx30000m -ea"
#berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"
#berkeley-process-options = "-EMWordAligner.numThreads 8"
#berkeley-posterior = 0.5

### if word alignment should be skipped,
# point to word alignment files
#
#word-alignment = $working-dir/model/aligned.1

### create a bilingual concordancer for the model
#
#biconcor = $moses-script-dir/ems/biconcor/biconcor

### lexicalized reordering: specify orientation type
# (default: only distance-based reordering model)
#
lexicalized-reordering = msd-bidirectional-fe

### hierarchical rule set
#
hierarchical-rule-set = true

### settings for rule extraction
#
#extract-settings = ""

### unknown word labels (target syntax only)
# enables use of unknown word labels during decoding
# label file is generated during rule extraction
#
#use-unknown-word-labels = true

### if phrase extraction should be skipped,
# point to stem for extract files
#
# extracted-phrases =

### settings for rule scoring
#
score-settings = "--GoodTuring"

### include word alignment in phrase table
#
#include-word-alignment-in-rules = yes

### if phrase table training should be skipped,
# point to phrase translation table
#
# phrase-translation-table =

### if reordering table training should be skipped,
# point to reordering table
#
# reordering-table =

### if training should be skipped,
# point to a configuration file that contains
# pointers to all relevant model files
#
#config-with-reused-weights =

#####################################################
### TUNING: finding good weights for model components

[TUNING]

### instead of tuning with this setting, old weights may be recycled
# specify here an old configuration file with matching weights
#
#weight-config = $working-dir/tuning/moses.filtered.ini.1

### tuning script to be used
#
tuning-script = $moses-script-dir/training/mert-moses.pl
tuning-settings = "-mertdir $moses-bin-dir --filtercmd
'$moses-script-dir/training/filter-model-given-input.pl'"

### specify the corpus used for tuning
# it should contain 1000s of sentences
#
#input-sgm =
raw-input = $wmt12-data/tuning/tuning.clean.$input-extension
#tokenized-input =
#factorized-input =
#input =
#
#reference-sgm =
raw-reference = $wmt12-data/tuning/tuning.clean.$output-extension
#tokenized-reference =
#factorized-reference =
#reference =

### size of n-best list used (typically 100)
#
nbest = 100

### ranges for weights for random initialization
# if not specified, the tuning script will use generic ranges
# it is not clear, if this matters
#
# lambda =

### additional flags for the filter script
#
#filter-settings = "-Binarizer CreateOnDiskPt 1 1 5 100 2 -Hierarchical"

### additional flags for the decoder
#
decoder-settings = ""

### if tuning should be skipped, specify this here
# and also point to a configuration file that contains
# pointers to all relevant model files
#
#config =

#########################################################
## RECASER: restore case, this part only trains the model

[RECASING]

#decoder = $moses-bin-dir/moses

### training data
# raw input needs to be still tokenized,
# also also tokenized input may be specified
#
#tokenized = [LM:europarl:tokenized-corpus]

# recase-config =

#lm-training = $srilm-dir/ngram-count

#######################################################
## TRUECASER: train model to truecase corpora and input

[TRUECASER]

### script to train truecaser models
#
trainer = $moses-script-dir/recaser/train-truecaser.perl

### training data
# data on which truecaser is trained
# if no training data is specified, parallel corpus is used
#
# raw-stem =
# tokenized-stem =

### trained model
#
# truecase-model =

######################################################################
## EVALUATION: translating a test set using the tuned system and score it

[EVALUATION]

### number of jobs (if parallel execution on cluster)
#
#jobs = 10

### additional flags for the filter script
#
#filter-settings = ""

### additional decoder settings
# switches for the Moses decoder
# common choices:
# "-threads N" for multi-threading
# "-mbr" for MBR decoding
# "-drop-unknown" for dropping unknown source words
# "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000" for cube
pruning
#
decoder-settings = "-search-algorithm 1 -cube-pruning-pop-limit 5000 -s
5000"

### specify size of n-best list, if produced
#
#nbest = 100

### multiple reference translations
#
#multiref = yes

### prepare system output for scoring
# this may include detokenization and wrapping output in sgm
# (needed for nist-bleu, ter, meteor)
#
detokenizer = "$moses-script-dir/tokenizer/detokenizer.perl -l
$output-extension"
#recaser = $moses-script-dir/recaser/recase.perl
wrapping-script = "$moses-script-dir/ems/support/wrap-xml.perl
$output-extension"
#output-sgm =

### BLEU
#
nist-bleu = $moses-script-dir/generic/mteval-v13a.pl
nist-bleu-c = "$moses-script-dir/generic/mteval-v13a.pl -c"
#multi-bleu = $moses-script-dir/generic/multi-bleu.perl
#ibm-bleu =

### TER: translation error rate (BBN metric) based on edit distance
# not yet integrated
#
# ter =

### METEOR: gives credit to stem / worknet synonym matches
# not yet integrated
#
# meteor =

### Analysis: carry out various forms of analysis on the output
#
analysis = $moses-script-dir/ems/support/analysis.perl
#
# also report on input coverage
analyze-coverage = yes
#
# also report on phrase mappings used
report-segmentation = yes
#
# report precision of translations for each input word, broken down by
# count of input word in corpus and model
#report-precision-by-coverage = yes
#
# further precision breakdown by factor
#precision-by-coverage-factor = pos

[EVALUATION:newstest2011]

### input data
#
#input-sgm = "$wmt12-data/$input-extension-test.txt"
#raw-input = $wmt12-data/$input-extension-test.txt
tokenized-input = "$wmt12-data/de-test.txt"
# factorized-input =
#input = $wmt12-data/$input-extension-test.txt

### reference data
#
#reference-sgm = "$wmt12-data/$output-extension-test.txt"
#raw-reference ="wmt12-data/$output-extension -test.txt
tokenized-reference = "$wmt12-data/el-test.txt"
#reference = $wmt12-data/el-test.txt

### analysis settings
# may contain any of the general evaluation analysis settings
# specific setting: base coverage statistics on earlier run
#
#precision-by-coverage-base = $working-dir/evaluation/test.analysis.5

### wrapping frame
# for nist-bleu and other scoring scripts, the output needs to be wrapped
# in sgm markup (typically like the input sgm)
#
wrapping-frame = $tokenized-input

##########################################
### REPORTING: summarize evaluation scores

[REPORTING]

### currently no parameters for reporting section

Thank you,

Dimitris Babaniotis

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] EMS fails on tuning

Reply via email to