Re: [Moses-support] Running Giza++ on subsets of data

Qin Gao Wed, 15 Jun 2011 12:30:05 -0700

Yes, MGIZA isn't really "incrementally training", it only initialize the
model parameters with that trained previously, since it does not store
sufficient statistics of the previous training. It will give bad performance
if


1. You train only model 1 or
2. The incremental data or sub set is really small

It is more suitable for the following scenario:

You train a model on corpus A, and have new data B, you want to train
several iterations of model 4 on A+B.

For incremental training giza, do you know does it use online EM (as in
Liang and Klein 2009) or just storing the sufficient statistics of previous
training?
--Q


On Wed, Jun 15, 2011 at 11:07 AM, Miles Osborne <[email protected]> wrote:

> that isn't the expected answer here.  i think the OP wants some kind of
> incremental (re) training.
>
> firstly: it is not really possible to guarantee that performance is not
> degraded when running from subsets up to the full set (compared with just
> running it on the full set).
>
> secondly,  you may wish to investigate a version of Giza which supports
> incremental retraining.  this would allow you to train on a subset and then
> add more and more data, without retraining at each point from scratch.   the
> current version has minimal documentation, but right now this is hopefully
> being fixed.  if you are feeling brave, look here:
>
> http://code.google.com/p/inc-giza-pp/
>
> Miles
>
>
> On 15 June 2011 18:50, Kenneth Heafield <[email protected]> wrote:
>
>> Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview
>>
>> On 06/15/11 04:51, Prasanth K wrote:
>> > Hello All,
>> >
>> > I am conducting a series of experiments to build translation systems
>> > using Moses in which the corpus of the current experiment is a subset of
>> > the corpora used in the previous experiment. I have started with the
>> > Europarl corpora and am likely to repeat this process about 20 times.
>> > Unless I am mistaken, this is going to take me nearly a month and I am
>> > looking for ways to speeden up the whole process.
>> >
>> > Is there any optimal way to run Giza++ on these different subsets of
>> > data without having to run it again and again?
>> > "I do not want to use the alignments obtained from running Giza++ on the
>> > entire Europarl corpora, for the other experiments (by selecting the
>> > alignment information from aligned.grow-final-and-diag for the sentences
>> > in the subsets)."
>> >
>> > The order of the experiments does not matter, so the experiments can be
>> > done on the smallest dataset followed by supersets of the previous
>> > dataset, provided there is a way to modify the translation probabilities
>> > from Giza++ using just the additional data alone and this does not
>> > affect the performance of Giza++ in comparison to when Giza++ is run on
>> > the corpus in stand-alone mode.
>> >
>> > Kindly let me know if there is some way to do this and I am missing it.
>> >
>> > - regards,
>> > Prasanth
>> >
>> >
>> > --
>> > "Theories have four stages of acceptance. i) this is worthless nonsense;
>> > ii) this is an interesting, but perverse, point of view, iii) this is
>> > true, but quite unimportant; iv) I always said so."
>> >
>> >   --- J.B.S. Haldane
>> >
>> >
>> >
>> > _______________________________________________
>> > Moses-support mailing list
>> > [email protected]
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Running Giza++ on subsets of data

Reply via email to