Re: [Moses-support] Running Giza++ on subsets of data

Miles Osborne Wed, 15 Jun 2011 12:29:46 -0700

it is this:
>
Abby Levenberg, Chris Callison-Burch and Miles Osborne. Stream-based
Translation Models for Statistical Machine
Translation<http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf>.
NAACL, Los Angeles, USA, 2010.


http://homepages.inf.ed.ac.uk/miles/papers/naacl10b.pdf

Miles

On 15 June 2011 19:28, Qin Gao <[email protected]> wrote:

> Yes, MGIZA isn't really "incrementally training", it only initialize the
> model parameters with that trained previously, since it does not store
> sufficient statistics of the previous training. It will give bad performance
> if
>
> 1. You train only model 1 or
> 2. The incremental data or sub set is really small
>
> It is more suitable for the following scenario:
>
> You train a model on corpus A, and have new data B, you want to train
> several iterations of model 4 on A+B.
>
> For incremental training giza, do you know does it use online EM (as in
> Liang and Klein 2009) or just storing the sufficient statistics of previous
> training?
> --Q
>
>
>
> On Wed, Jun 15, 2011 at 11:07 AM, Miles Osborne <[email protected]>wrote:
>
>> that isn't the expected answer here.  i think the OP wants some kind of
>> incremental (re) training.
>>
>> firstly: it is not really possible to guarantee that performance is not
>> degraded when running from subsets up to the full set (compared with just
>> running it on the full set).
>>
>> secondly,  you may wish to investigate a version of Giza which supports
>> incremental retraining.  this would allow you to train on a subset and then
>> add more and more data, without retraining at each point from scratch.   the
>> current version has minimal documentation, but right now this is hopefully
>> being fixed.  if you are feeling brave, look here:
>>
>> http://code.google.com/p/inc-giza-pp/
>>
>> Miles
>>
>>
>> On 15 June 2011 18:50, Kenneth Heafield <[email protected]> wrote:
>>
>>> Try using MGIZA: http://geek.kyloo.net/software/doku.php/mgiza:overview
>>>
>>> On 06/15/11 04:51, Prasanth K wrote:
>>> > Hello All,
>>> >
>>> > I am conducting a series of experiments to build translation systems
>>> > using Moses in which the corpus of the current experiment is a subset
>>> of
>>> > the corpora used in the previous experiment. I have started with the
>>> > Europarl corpora and am likely to repeat this process about 20 times.
>>> > Unless I am mistaken, this is going to take me nearly a month and I am
>>> > looking for ways to speeden up the whole process.
>>> >
>>> > Is there any optimal way to run Giza++ on these different subsets of
>>> > data without having to run it again and again?
>>> > "I do not want to use the alignments obtained from running Giza++ on
>>> the
>>> > entire Europarl corpora, for the other experiments (by selecting the
>>> > alignment information from aligned.grow-final-and-diag for the
>>> sentences
>>> > in the subsets)."
>>> >
>>> > The order of the experiments does not matter, so the experiments can be
>>> > done on the smallest dataset followed by supersets of the previous
>>> > dataset, provided there is a way to modify the translation
>>> probabilities
>>> > from Giza++ using just the additional data alone and this does not
>>> > affect the performance of Giza++ in comparison to when Giza++ is run on
>>> > the corpus in stand-alone mode.
>>> >
>>> > Kindly let me know if there is some way to do this and I am missing it.
>>> >
>>> > - regards,
>>> > Prasanth
>>> >
>>> >
>>> > --
>>> > "Theories have four stages of acceptance. i) this is worthless
>>> nonsense;
>>> > ii) this is an interesting, but perverse, point of view, iii) this is
>>> > true, but quite unimportant; iv) I always said so."
>>> >
>>> >   --- J.B.S. Haldane
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Moses-support mailing list
>>> > [email protected]
>>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in Scotland,
>> with registration number SC005336.
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Running Giza++ on subsets of data

Reply via email to