Hi Tom -- Did you run out of RAM? That is the typical cause of failure.
snt2cooc simply calculates the list of co-occurring vocabulary items. This is for use in creating the ttable and count data structures, these two tables are perfectly sized (when using the binary search build option). The dimensions are determined using the .cooc files. Attached is a simple perl script that does the same thing in about 1 GB RAM, it depends on writing to disk and using sort. It would probably also be easy to improve the C++ code to be more memory efficient, I don't think Franz made any effort there (but I haven't looked at the code). Cheers, Alex On Sat, Sep 3, 2011 at 7:42 AM, Tom Hoar <[email protected]> wrote: > The command lines for MGIZA++'s snt2cooc utility and GIZA++'s snt2cooc.out > utility are different. > > GIZA++ command line: > Usage: snt2cooc.out vcb1 vcb2 snt12 > output > > MGIZA++ command line: > Usage: snt2cooc output vcb1 vcb2 snt12 > > To use MGIZA++ without installing GIZA++ and without editing > train-model.perl, I created this snt2cooc.out Bash script wrapper. It must > be in the same folder with snt2cooc. It translates the GIZA++ command line > to MGIZA++ syntax: > > #! /bin/bash > set -e > usage() { > echo "Usage: snt2cooc.out vcb1 vcb2 snt12" > echo "Converts GIZA++ snt-format into plain text." > exit 1 > } > [ $# -ne 3 ] && usage > ${0%/*}/snt2cooc /dev/stdout $1 $2 $3 > exit 0 > > It has worked flawlessly with a variety of corpora up to 2 million pairs. > Now, train-model.perl failed with the following Segmentation fault error > when processing an 11.9 million pair corpus: > > (2.1a) running snt2cooc en-pt @ Thu Sep 1 22:47:48 CEST 2011 > Executing: mkdir -p > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1 > Executing: /usr/local/bin/snt2cooc.out > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/pt.vcb > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/en.vcb > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/part1/en-pt-int-train.snt >> > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1/en-pt.cooc > /usr/local/bin/snt2cooc.out > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/pt.vcb > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/en.vcb > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/classes/part1/en-pt-int-train.snt >> > /opt/domy/TRAININGS/alignments/align-12_M-en-pt/giza.en-pt/part1/en-pt.cooc > line 1000 > line 2000 > line 3000 > ... > line 5506000 > line 5507000 > /usr/local/bin/snt2cooc.out: line 21: 31858 Segmentation fault > ${0%/*}/snt2cooc /dev/stdout $1 $2 $3 > Exit code: 139 > ERROR at /usr/local/bin/train-model.perl line 1031. > > Is it possible that a stream of 5.5 million lines could causes a buffer > overflow or segfault with the snt2cooc.out Bash script above? > > Alternately, could a violation of 9:1 ratio or 100-token phrase cause this > problem? > > Could it be a problematic character? I've cleaned the vertical bar, all the > non-printing control characters and removed multiple whitespace. Could there > be others? > > Thanks, > Tom > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
align_generate_cooc.pl
Description: Binary data
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
