Lewis — This is a good-sized dataset, and on a single desktop machine, I expect 
it would take at least a day to go all the way through alignment, 
model-building, and tuning.

fast_align is a good idea, though it isn't integrated into the pipeline 
(shouldn't be too hard, and is on the list). You could also just try "--aligner 
berkeley" and see if that works. 

Do you see anything in the GIZA error logs (RUNDIR/alignment/0/...)? Sometimes 
GIZA doesn't compile correctly, and this could be an error where it doesn't 
find GIZA++ or one of the support binaries (mkcls, snt2cooc.out).

matt


> On Jul 16, 2016, at 6:01 PM, Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> 
> wrote:
> 
> Hi Folks,
> When attempting to build a heiro model using 5K sentences for tuning, many
> many more than that for testing and again many many more than that for the
> actual corpus (~880K) I get the following error within the GIZA alignment
> pipeline phase.
> 
> Anyone have a clue what this means? I have the full GIZA logs if they are
> useful.
> I did find a thread on a VERY similar issue at [0]. The solution seems to
> be to use absolute paths to all input data for the pipeline however that is
> exactly what I've done e.g.
> 
> $JOSHUA/bin/pipeline.pl  --rundir . --type hiero --corpus
> /usr/local/joshua_input/commoncrawl.ru-en --tune
> /usr/local/joshua_input/commoncrawl.ru-en.tune --test
> /usr/local/joshua_input/commoncrawl.ru-en.test --source en --target ru
> --rundir experiment1/1 --readme “Experiment 1 Run 1 Hiero Russian to
> English Translation model” --mbr
> 
> Where the parallel .en and .ru sentence files exist for all of the above
> corpus, tune and test paths respectively.
> 
> [0] http://comments.gmane.org/gmane.comp.nlp.moses.user/10489
> 
> I have been having trouble consistently when generating models using
> GIZA... is there a suggested alignment substitute which I should be trying
> out?
> 
> One last question... roughly how long should a Hiero-based LM for a corpus
> of ~880K sentences take on say a MacBook Pro 2.7GHz Interl Core i7 16GB
> mem. I remeber reading a while ago on the old Joshua site that a pipeline
> would run in 10 or so minutes... this is clearly not the case and I would
> like to share/compare some results if possible with others who are in the
> business of generating LM and language packs.
> 
> Thanks
> 
> ==========================================================
> Executing: bash -c rm -f alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz
> Executing: bash -c gzip alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final
> Waiting for second GIZA process...
> (3) generate word alignment @ Fri Jul 15 16:38:42 PDT 2016
> Combining forward and inverted alignment from files:
>  alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.{bz2,gz}
>  alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.{bz2,gz}
> Executing: bash -c mkdir -p alignments/0/model
> Executing: bash -c /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d
> <(gzip -cd alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
> |/usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> symal: computing grow alignment: diagonal (1) final (1)both-uncovered (0)
> skip=<0> counts=<817962>
> symal(9081,0x7fff76241310) malloc: *** error for object 0x7fff74472250:
> pointer being freed was not allocated
> *** set a breakpoint in malloc_error_break to debug
> bash: line 1:  9080 Done
> /usr/local/incubator-joshua/ext/symal/giza2bal.pl -d <(gzip -cd
> alignments/0/giza.ru.0-en.0/ru.0-en.0.A3.final.gz) -i <(gzip -cd
> alignments/0/giza.en.0-ru.0/en.0-ru.0.A3.final.gz)
>      9081 Abort trap: 6           |
> /usr/local/incubator-joshua/ext/symal/symal -alignment="grow"
> -diagonal="yes" -final="yes" -both="no"
> -o=alignments/0/model/aligned.grow-diag-final
> Exit code: 134
> ERROR: Can't generate symmetrized alignment file
> 
> 
> 
> -- 
> *Lewis*

Reply via email to