The command lines for MGIZA++'s snt2cooc utility and GIZA++'s
snt2cooc.out utility are different.
GIZA++ command line:
Usage:
snt2cooc.out vcb1 vcb2 snt12 > output
MGIZA++ command line:
Usage:
snt2cooc output vcb1 vcb2 snt12
To use MGIZA++ without installing
GIZA++ and without editing train-model.perl, I created this snt2cooc.out
Bash script wrapper. It must be in the same folder with snt2cooc. It
translates the GIZA++ command line to MGIZA++ syntax:
#! /bin/bash
set
-e
usage() {
echo "Usage: snt2cooc.out vcb1 vcb2 snt12"
echo "Converts
GIZA++ snt-format into plain text."
exit 1
}
[ $# -ne 3 ] &&
usage
${0%/*}/snt2cooc /dev/stdout $1 $2 $3
exit 0
It has worked
flawlessly with a variety of corpora up to 2 million pairs. Now,
train-model.perl failed with the following Segmentation fault error when
processing an 11.9 million pair corpus:
(2.1a) running snt2cooc en-pt
@ Thu Sep 1 22:47:48 CEST 2011
Executing: mkdir -p
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1
Executing:
/usr/local/bin/snt2cooc.out
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/pt.vcb
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/en.vcb
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/part1/en-pt-int-train.snt
>
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1/en-pt.cooc
/usr/local/bin/snt2cooc.out
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/pt.vcb
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/en.vcb
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/part1/en-pt-int-train.snt
>
/opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1/en-pt.cooc
line
1000
line 2000
line 3000
...
line 5506000
line
5507000
/usr/local/bin/snt2cooc.out: line 21: 31858 Segmentation fault
${0%/*}/snt2cooc /dev/stdout $1 $2 $3
Exit code: 139
ERROR at
/usr/local/bin/train-model.perl line 1031.
Is it possible that a
stream of 5.5 million lines could causes a buffer overflow or segfault
with the snt2cooc.out Bash script above?
Alternately, could a
violation of 9:1 ratio or 100-token phrase cause this problem?
Could
it be a problematic character? I've cleaned the vertical bar, all the
non-printing control characters and removed multiple whitespace. Could
there be others?
Thanks,
Tom _______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support