Yes, MGIZA isn't really "incrementally training", it only initialize the model parameters with that trained previously, since it does not store sufficient statistics of the previous training. It will give bad performance if
1. You train only model 1 or 2. The incremental data or sub set is really small It is more suitable for the following scenario: You train a model on corpus A, and have new data B, you want to train several iterations of model 4 on A+B. For incremental training giza, do you know does it use online EM (as in Liang and Klein 2009) or just storing the sufficient statistics of previous training? --Q On Wed, Jun 15, 2011 at 11:07 AM, Miles Osborne <[email protected]> wrote: > that isn't the expected answer here. i think the OP wants some kind of > incremental (re) training. > > firstly: it is not really possible to guarantee that performance is not > degraded when running from subsets up to the full set (compared with just > running it on the full set). > > secondly, you may wish to investigate a version of Giza which supports > incremental retraining. this would allow you to train on a subset and then > add more and more data, without retraining at each point from scratch. the > current version has minimal documentation, but right now this is hopefully > being fixed. if you are feeling brave, look here: > > http://code.google.com/p/inc-giza-pp/ > > Miles > > > On 15 June 2011 18:50, Kenneth Heafield <[email protected]> wrote: > >> Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview >> >> On 06/15/11 04:51, Prasanth K wrote: >> > Hello All, >> > >> > I am conducting a series of experiments to build translation systems >> > using Moses in which the corpus of the current experiment is a subset of >> > the corpora used in the previous experiment. I have started with the >> > Europarl corpora and am likely to repeat this process about 20 times. >> > Unless I am mistaken, this is going to take me nearly a month and I am >> > looking for ways to speeden up the whole process. >> > >> > Is there any optimal way to run Giza++ on these different subsets of >> > data without having to run it again and again? >> > "I do not want to use the alignments obtained from running Giza++ on the >> > entire Europarl corpora, for the other experiments (by selecting the >> > alignment information from aligned.grow-final-and-diag for the sentences >> > in the subsets)." >> > >> > The order of the experiments does not matter, so the experiments can be >> > done on the smallest dataset followed by supersets of the previous >> > dataset, provided there is a way to modify the translation probabilities >> > from Giza++ using just the additional data alone and this does not >> > affect the performance of Giza++ in comparison to when Giza++ is run on >> > the corpus in stand-alone mode. >> > >> > Kindly let me know if there is some way to do this and I am missing it. >> > >> > - regards, >> > Prasanth >> > >> > >> > -- >> > "Theories have four stages of acceptance. i) this is worthless nonsense; >> > ii) this is an interesting, but perverse, point of view, iii) this is >> > true, but quite unimportant; iv) I always said so." >> > >> > --- J.B.S. Haldane >> > >> > >> > >> > _______________________________________________ >> > Moses-support mailing list >> > [email protected] >> > http://mailman.mit.edu/mailman/listinfo/moses-support >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > > -- > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
