Lewis — This is a good-sized dataset, and on a single desktop machine, I expect it would take at least a day to go all the way through alignment, model-building, and tuning.
fast_align is a good idea, though it isn't integrated into the pipeline (shouldn't be too hard, and is on the list). You could also just try "--aligner berkeley" and see if that works. Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? Sometimes GIZA doesn't compile correctly, and this could be an error where it doesn't find GIZA++ or one of the support binaries (mkcls, snt2cooc.out). matt > On Jul 16, 2016, at 6:01 PM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> > wrote: > > Hi Folks, > When attempting to build a heiro model using 5K sentences for tuning, many > many more than that for testing and again many many more than that for the > actual corpus (~880K) I get the following error within the GIZA alignment > pipeline phase. > > Anyone have a clue what this means? I have the full GIZA logs if they are > useful. > I did find a thread on a VERY similar issue at [0]. The solution seems to > be to use absolute paths to all input data for the pipeline however that is > exactly what I've done e.g. > > $JOSHUA/bin/pipeline.pl --rundir . --type hiero --corpus > /usr/local/joshua_input/commoncrawl.ru-en --tune > /usr/local/joshua_input/commoncrawl.ru-en.tune --test > /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru > --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to > English Translation model” --mbr > > Where the parallel .en and .ru sentence files exist for all of the above > corpus, tune and test paths respectively. > > [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489 > > I have been having trouble consistently when generating models using > GIZA... is there a suggested alignment substitute which I should be trying > out? > > One last question... roughly how long should a Hiero-based LM for a corpus > of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB > mem. I remeber reading a while ago on the old Joshua site that a pipeline > would run in 10 or so minutes... this is clearly not the case and I would > like to share/compare some results if possible with others who are in the > business of generating LM and language packs. > > Thanks > > ========================================================== > Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz > Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final > Waiting for second GIZA process... > (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016 > Combining forward and inverted alignment from files: > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz} > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz} > Executing: bash -c mkdir -p alignments/0/model > Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d > <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0) > skip=<0> counts=<817962> > symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250: > pointer being freed was not allocated > *** set a breakpoint in malloc_error_break to debug > bash: line 1: 9080 Done > /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd > alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd > alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz) > 9081 Abort trap: 6 | > /usr/local/incubator-joshua/ext/symal/symal -alignment="grow" > -diagonal="yes" -final="yes" -both="no" > -o=alignments/0/model/aligned.grow-diag-final > Exit code: 134 > ERROR: Can't generate symmetrized alignment file > > > > -- > *Lewis*