Re: [Moses-support] Running Giza++ on subsets of data

Miles Osborne Wed, 15 Jun 2011 11:09:54 -0700

that isn't the expected answer here.  i think the OP wants some kind of
incremental (re) training.


firstly: it is not really possible to guarantee that performance is not
degraded when running from subsets up to the full set (compared with just
running it on the full set).

secondly,  you may wish to investigate a version of Giza which supports
incremental retraining.  this would allow you to train on a subset and then
add more and more data, without retraining at each point from scratch.   the
current version has minimal documentation, but right now this is hopefully
being fixed.  if you are feeling brave, look here:

http://code.google.com/p/inc-giza-pp/

Miles


On 15 June 2011 18:50, Kenneth Heafield <[email protected]> wrote:

> Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview
>
> On 06/15/11 04:51, Prasanth K wrote:
> > Hello All,
> >
> > I am conducting a series of experiments to build translation systems
> > using Moses in which the corpus of the current experiment is a subset of
> > the corpora used in the previous experiment. I have started with the
> > Europarl corpora and am likely to repeat this process about 20 times.
> > Unless I am mistaken, this is going to take me nearly a month and I am
> > looking for ways to speeden up the whole process.
> >
> > Is there any optimal way to run Giza++ on these different subsets of
> > data without having to run it again and again?
> > "I do not want to use the alignments obtained from running Giza++ on the
> > entire Europarl corpora, for the other experiments (by selecting the
> > alignment information from aligned.grow-final-and-diag for the sentences
> > in the subsets)."
> >
> > The order of the experiments does not matter, so the experiments can be
> > done on the smallest dataset followed by supersets of the previous
> > dataset, provided there is a way to modify the translation probabilities
> > from Giza++ using just the additional data alone and this does not
> > affect the performance of Giza++ in comparison to when Giza++ is run on
> > the corpus in stand-alone mode.
> >
> > Kindly let me know if there is some way to do this and I am missing it.
> >
> > - regards,
> > Prasanth
> >
> >
> > --
> > "Theories have four stages of acceptance. i) this is worthless nonsense;
> > ii) this is an interesting, but perverse, point of view, iii) this is
> > true, but quite unimportant; iv) I always said so."
> >
> >   --- J.B.S. Haldane
> >
> >
> >
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Running Giza++ on subsets of data

Reply via email to