Hi all,
I have installed the incremental giza software but there are some
problems when I run it.
I have a trained model on 1M sentences and I intend to incrementally
update the model using a batch of 4000 sentences. But when I run GIZA++
it is giving some weird memory error.
*** glibc detected *** GIZA++-v2/GIZA++: malloc(): smallbin double
linked list corrupted: 0x0000000012fd4e90 ***
This error comes while loading the old alignment file in readJumps
function in HMMTables.cpp
The format of the alignment file which I give input (*.a2.*) is
1 2 1 100 0.999946
0 3 1 100 0.000132484
1 3 1 100 0.999867
1 4 1 100 1
<src-pos> <trg-pos> <l-src> <l-trg> <prob>
but in this readJumps function reads the line in a different manner
<sentence-length> <jump> <prob> <jump> <prob> ...
either I am giving a wrong input or the parsing is different.
Could you help me here? Is there any constraints that I should be aware of?
My giza config file for the incremental training and its log file is
attached.
Another Question : Why is the target sentence length always 100?
Thanks,
Prashant Mathur
<!some unnecessary lines are deleted in the log file due to its size!>
S: /project/test-for-alignments/splits/1/1.en.vcb
T: /project/test-for-alignments/splits/1/1.it.vcb
C: /project/test-for-alignments/splits/1/1.en_1.it.snt
O: output/stepWise.hmm
coocurrencefile: /project/test-for-alignments/splits/1/1.it-en.cooc
model1iterations: 5
model1dumpfrequency: 5
hmmiterations: 1
hmmdumpfrequency: 1
model2iterations: 0
model3iterations: 0
model4iterations: 0
model5iterations: 0
emAlignmentDependencies: 1
step_k: 1
oldTrPrbs:
/project/training-stage/trained-data/IT-Domain-scrambled-base/en-it/lc/giza.en-it/en-it.t1.5
oldAlPrbs:
/project/training-stage/trained-data/IT-Domain-scrambled-base/en-it/lc/giza.en-it/en-it.a2.5
The following options are from the config file and will be overwritten by any command line options.
Parameter 's' changed from '' to '/project/test-for-alignments/splits/1/1.en.vcb'
Parameter 't' changed from '' to '/project/test-for-alignments/splits/1/1.it.vcb'
Parameter 'c' changed from '' to '/project/test-for-alignments/splits/1/1.en_1.it.snt'
Parameter 'o' changed from '112-06-07.144621.prashant' to 'output/stepWise.hmm'
Parameter 'coocurrencefile' changed from '' to '/project/test-for-alignments/splits/1/1.it-en.cooc'
Parameter 'model1dumpfrequency' changed from '0' to '5'
Parameter 'hmmiterations' changed from '5' to '1'
Parameter 'hmmdumpfrequency' changed from '0' to '1'
Parameter 'model3iterations' changed from '5' to '0'
Parameter 'model4iterations' changed from '5' to '0'
Parameter 'stepk' changed from '0' to '1'
Parameter 'oldtrprbs' changed from '' to '/project/training-stage/trained-data/IT-Domain-scrambled-base/en-it/lc/giza.en-it/en-it.t1.5'
Parameter 'oldalprbs' changed from '' to '/project/training-stage/trained-data/IT-Domain-scrambled-base/en-it/lc/giza.en-it/en-it.a2.5'
general parameters:
-------------------
ml = 101 (maximum sentence length)
No. of iterations:
-------------------
hmmiterations = 1 (mh)
model1iterations = 5 (number of iterations for Model 1)
model2iterations = 0 (number of iterations for Model 2)
model3iterations = 0 (number of iterations for Model 3)
model4iterations = 0 (number of iterations for Model 4)
model5iterations = 0 (number of iterations for Model 5)
model6iterations = 0 (number of iterations for Model 6)
parameter for various heuristics in GIZA++ for efficient training:
------------------------------------------------------------------
countincreasecutoff = 1e-06 (Counts increment cutoff threshold)
countincreasecutoffal = 1e-05 (Counts increment cutoff threshold for alignments in training of fertility models)
mincountincrease = 1e-07 (minimal count increase)
peggedcutoff = 0.03 (relative cutoff probability for alignment-centers in pegging)
probcutoff = 1e-07 (Probability cutoff threshold for lexicon probabilities)
probsmooth = 1e-07 (probability smoothing (floor) value )
rpcport = 8090 (port to run the XMLRPC server on)
skipunfound = 1 (Flag to skip missing cooc entries)
stepalpha = 0.9 (stepsize)
stepk = 1 (Number of ONLINE UPDATES made so far)
parameters for describing the type and amount of output:
-----------------------------------------------------------
compactalignmentformat = 0 (0: detailled alignment format, 1: compact alignment format )
hmmdumpfrequency = 1 (dump frequency of HMM)
l = 112-06-07.144621.prashant.log (log file name)
log = 0 (0: no logfile; 1: logfile)
model1dumpfrequency = 5 (dump frequency of Model 1)
model2dumpfrequency = 0 (dump frequency of Model 2)
model345dumpfrequency = 0 (dump frequency of Model 3/4/5)
nbestalignments = 0 (for printing the n best alignments)
nodumps = 0 (1: do not write any files)
o = output/stepWise.hmm (output file prefix)
onlyaldumps = 0 (1: do not write any files)
outputpath = (output path)
rungizaserver = 0 (1: run GIZA as XMLRPC server)
transferdumpfrequency = 0 (output: dump of transfer from Model 2 to 3)
verbose = 0 (0: not verbose; 1: verbose)
verbosesentence = -10 (number of sentence for which a lot of information should be printed (negative: no output))
parameters describing input files:
----------------------------------
c = /project/test-for-alignments/splits/1/1.en_1.it.snt (training corpus file name)
d = (dictionary file name)
s = /project/test-for-alignments/splits/1/1.en.vcb (source vocabulary file name)
t = /project/test-for-alignments/splits/1/1.it.vcb (target vocabulary file name)
tc = (test corpus file name)
smoothing parameters:
---------------------
emalsmooth = 0.2 (f-b-trn: smoothing factor for HMM alignment model (can be ignored by -emSmoothHMM))
model23smoothfactor = 0 (smoothing parameter for IBM-2/3 (interpolation with constant))
model4smoothfactor = 0.2 (smooting parameter for alignment probabilities in Model 4)
model5smoothfactor = 0.1 (smooting parameter for distortion probabilities in Model 5 (linear interpolation with constant))
nsmooth = 64 (smoothing for fertility parameters (good value: 64): weight for wordlength-dependent fertility parameters)
nsmoothgeneral = 0 (smoothing for fertility parameters (default: 0): weight for word-independent fertility parameters)
parameters modifying the models:
--------------------------------
compactadtable = 1 (1: only 3-dimensional alignment table for IBM-2 and IBM-3)
deficientdistortionforemptyword = 0 (0: IBM-3/IBM-4 as described in (Brown et al. 1993); 1: distortion model of empty word is deficient; 2: distoriton model of empty word is deficient (differently); setting this parameter also helps to avoid that during IBM-3 and IBM-4 training too many words are aligned with the empty word)
depm4 = 76 (d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
depm5 = 68 (d_{=1}: &1:l, &2:m, &4:F, &8:E, d_{>1}&16:l, &32:m, &64:F, &128:E)
emalignmentdependencies = 1 (lextrain: dependencies in the HMM alignment model. &1: sentence length; &2: previous class; &4: previous position; &8: French position; &16: French class)
emprobforempty = 0.4 (f-b-trn: probability for empty word)
parameters modifying the EM-algorithm:
--------------------------------------
m5p0 = -1 (fixed value for parameter p_0 in IBM-5 (if negative then it is determined in training))
manlexfactor1 = 0 ()
manlexfactor2 = 0 ()
manlexmaxmultiplicity = 20 ()
maxfertility = 10 (maximal fertility for fertility models)
p0 = -1 (fixed value for parameter p_0 in IBM-3/4 (if negative then it is determined in training))
pegging = 0 (0: no pegging; 1: do pegging)
reading vocabulary files
Reading vocabulary file from:/project/test-for-alignments/splits/1/1.en.vcb
Reading vocabulary file from:/project/test-for-alignments/splits/1/1.it.vcb
Source vocabulary list has 127317 unique tokens
Target vocabulary list has 145374 unique tokens
Calculating vocabulary frequencies from corpus /project/test-for-alignments/splits/1/1.en_1.it.snt
Reading more sentence pairs into memory ...
Corpus fits in memory, corpus has: 4164 sentence pairs.
Train total # sentence pairs (weighted): 2082
Size of source portion of the training corpus: 11812.5 tokens
Size of the target portion of the training corpus: 12102.5 tokens
In source portion of the training corpus, only 2569 unique tokens appeared
In target portion of the training corpus, only 2842 unique tokens appeared
lambda for PP calculation in IBM-1,IBM-2,HMM:= 12102.5/(13894.5-2082)== 1.02455
Loading coocurrence file...
There are 21472913 21472913 entries in table
Model1: loading t table
Reading T prob. table from /project/training-stage/trained-data/IT-Domain-scrambled-base/en-it/lc/giza.en-it/en-it.t1.5
A:DID NOT FIND ENTRY: 3 0
A:DID NOT FIND ENTRY: 4 0
A:DID NOT FIND ENTRY: 5 0
A:DID NOT FIND ENTRY: 6 0
A:DID NOT FIND ENTRY: 7 0
A:DID NOT FIND ENTRY: 8 0
A:DID NOT FIND ENTRY: 10 0
A:DID NOT FIND ENTRY: 11 0
A:DID NOT FIND ENTRY: 12 0
A:DID NOT FIND ENTRY: 13 0
A:DID NOT FIND ENTRY: 14 0
A:DID NOT FIND ENTRY: 15 0
A:DID NOT FIND ENTRY: 16 0
A:DID NOT FIND ENTRY: 17 0
A:DID NOT FIND ENTRY: 18 0
A:DID NOT FIND ENTRY: 19 0
A:DID NOT FIND ENTRY: 20 0
A:DID NOT FIND ENTRY: 21 0
A:DID NOT FIND ENTRY: 33 0
A:DID NOT FIND ENTRY: 34 0
A:DID NOT FIND ENTRY: 36 0
A:DID NOT FIND ENTRY: 37 0
A:DID NOT FIND ENTRY: 38 0
A:DID NOT FIND ENTRY: 39 0
A:DID NOT FIND ENTRY: 40 0
A:DID NOT FIND ENTRY: 41 0
A:DID NOT FIND ENTRY: 42 0
A:DID NOT FIND ENTRY: 44 1
A:DID NOT FIND ENTRY: 47 0
A:DID NOT FIND ENTRY: 49 1
A:DID NOT FIND ENTRY: 50 0
A:DID NOT FIND ENTRY: 52 0
A:DID NOT FIND ENTRY: 54 0
A:DID NOT FIND ENTRY: 58 0
A:DID NOT FIND ENTRY: 59 0
A:DID NOT FIND ENTRY: 61 0
A:DID NOT FIND ENTRY: 62 1
A:DID NOT FIND ENTRY: 63 0
A:DID NOT FIND ENTRY: 68 0
A:DID NOT FIND ENTRY: 70 0
A:DID NOT FIND ENTRY: 72 0
A:DID NOT FIND ENTRY: 75 1
A:DID NOT FIND ENTRY: 79 0
A:DID NOT FIND ENTRY: 83 1
A:DID NOT FIND ENTRY: 86 0
A:DID NOT FIND ENTRY: 87 1
A:DID NOT FIND ENTRY: 89 0
A:DID NOT FIND ENTRY: 92 1
A:DID NOT FIND ENTRY: 95 1
A:DID NOT FIND ENTRY: 97 1
A:DID NOT FIND ENTRY: 98 0
A:DID NOT FIND ENTRY: 100 0
A:DID NOT FIND ENTRY: 101 1
A:DID NOT FIND ENTRY: 102 1
A:DID NOT FIND ENTRY: 103 1
A:DID NOT FIND ENTRY: 112 0
Read 13011 entries into prob. table. 9532 not found.
==========================================================
Model1 Training Started at: Thu Jun 7 14:46:29 2012
-----------
Model1: Iteration 1
number of French (target) words = 145374
initial unifrom prob = 6.87881e-06
Model1: (1) TRAIN CROSS-ENTROPY 20.9065 PERPLEXITY 1.96553e+06
Model1: (1) VITERBI TRAIN CROSS-ENTROPY inf PERPLEXITY inf
Model 1 Iteration: 1 took: 0 seconds
-----------
Model1: Iteration 2
number of French (target) words = 145374
initial unifrom prob = 6.87881e-06
Model1: (2) TRAIN CROSS-ENTROPY 5.46876 PERPLEXITY 44.2853
Model1: (2) VITERBI TRAIN CROSS-ENTROPY 6.84189 PERPLEXITY 114.713
Model 1 Iteration: 2 took: 0 seconds
-----------
Model1: Iteration 3
number of French (target) words = 145374
initial unifrom prob = 6.87881e-06
Model1: (3) TRAIN CROSS-ENTROPY 5.16462 PERPLEXITY 35.8678
Model1: (3) VITERBI TRAIN CROSS-ENTROPY 6.34596 PERPLEXITY 81.3436
Model 1 Iteration: 3 took: 1 seconds
-----------
Model1: Iteration 4
number of French (target) words = 145374
initial unifrom prob = 6.87881e-06
Model1: (4) TRAIN CROSS-ENTROPY 4.95396 PERPLEXITY 30.9948
Model1: (4) VITERBI TRAIN CROSS-ENTROPY 5.99987 PERPLEXITY 63.9943
Model 1 Iteration: 4 took: 0 seconds
-----------
Model1: Iteration 5
number of French (target) words = 145374
initial unifrom prob = 6.87881e-06
Model1: (5) TRAIN CROSS-ENTROPY 4.80105 PERPLEXITY 27.8779
Model1: (5) VITERBI TRAIN CROSS-ENTROPY 5.75541 PERPLEXITY 54.0197
Model 1 Iteration: 5 took: 29 seconds
Entire Model1 Training took: 30 seconds
Loading HMM alignments from file.
sentLength 0
1 : 0
100 : 0.035315
sentLength 9
20 : 1
100 : 5.5847
sentLength 10
20 : 1
100 : 4.8827
sentLength 11
20 : 1
100 : 9.289
sentLength 0
21 : 1
100 : 2.7217
sentLength 1
21 : 1
100 : 0.00405142
sentLength 2
21 : 1
100 : 0.95798
sentLength 3
21 : 1
100 : 0.37289
sentLength 4
21 : 1
100 : 0.28342
sentLength 5
21 : 1
100 : 0.38787
sentLength 6
21 : 1
100 : 0.00861426
sentLength 7
21 : 1
100 : 0.75645
sentLength 8
21 : 1
100 : 4.3335
sentLength 9
21 : 1
100 : 0.3136
sentLength 10
21 : 1
100 : 0.8046
sentLength 11
21 : 1
100 : 7.0554
sentLength 0
22 : 1
100 : 0.2305
sentLength 1
22 : 1
100 : 0.05117
sentLength 2
22 : 1
100 : 0.21584
sentLength 3
22 : 1
100 : 0.67759
sentLength 4
22 : 1
100 : 0.47402
sentLength 5
22 : 1
100 : 0.72526
sentLength 6
22 : 1
100 : 0.0938
sentLength 7
22 : 1
100 : 0.17712
sentLength 8
22 : 1
100 : 0.256228
sentLength 9
22 : 1
100 : 0.5512
sentLength 10
22 : 1
100 : 0.15876
sentLength 11
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support