> Do you think this is possible? Would Giza++ require massive > modifications to be able to align these kind of tokens? My gut feeling > was that a n-gram with a gap in (a template) is to all intents and > purposes just the same as an n-gram and so the algorithm should > perform with similar accuracy. Giza operates in a word-by-word fashion. So, when you see multiple words aligning to the same thing, as far as the model is concerned, only accidental. Extending alignment models to deal with n-grams or n-grams with gaps makes them considerably more difficult to estimate, and makes Giza a poor starting point for such attempts. But, there has been a variety of work in this area though. For a starting point, you can look at:
Daniel Marcu & William Wong. (2002) A phrase-based, joint probability model for statistical machine translation. In Proceedings of EMNLP Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for Statistical Machine Translation. John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP > > Any thoughts? > > James > > Quoting Philipp Koehn <[email protected]>: > >> Hi, >> >> is the idea to replace certain parts of the text with tokens such as DATE >> and then align the rest of the sentence? I'd suggest to just reformat the >> training data, make sure that matching tokens are added to each sentence >> pair, and for good measure add 1000 sentences pairs that only contain >> DATE for input and output language. >> >> -phi >> >> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote: >>> Consider the following sentence pair. >>> >>> I declare resumed the session of the European Parliament adjourned on Friday >>> 17 December 1999 >>> >>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode >>> des Europäischen Parlaments für wiederaufgenommen >>> >>> This sentence can be reduced to the following templates: >>> >>> I declare resumed the session of the European Parliament adjourned on ___ >>> >>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen >>> Parlaments für wiederaufgenommen >>> >>> Given a set of candidate tokens for such template could the current >>> implementation of Giza++ figure out which template pairs align or do you >>> think the code would need serious modifications? >>> >>> I hope this made my question clearer. >>> >>> >>> Quoting Philipp Koehn <[email protected]>: >>> >>>> Hi, >>>> >>>> not sure, what you are asking for - are you looking for phrasal >>>> alignments, in other words frequent occurrences of the example >>>> you mention? This is done by the phrase extraction scripts. >>>> >>>> -phi >>>> >>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> thanks to everybody for responses to my query about parallelising >>>>> Giza++. All the responses were very useful and have helped the project >>>>> make quick progress. >>>>> >>>>> The greater intention is to use Giza++ to automatically find template >>>>> translation pairs >>>>> >>>>> e.g. >>>>> >>>>> English - My name is x >>>>> Italian - Mi chiamo x >>>>> >>>>> Does anybody have any ideas about how adaptable Giza++ is in its >>>>> current state to learning such pairs? Would it be a simple case of >>>>> presenting Giza++ with candidate tokens to align? Or would >>>>> modifications to the EM algorithms be necessary to accomplish this? >>>>> >>>>> Thanks in advance for any suggestions. >>>>> >>>>> James >>>>> >>>>> -- >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> >> >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
