no: Giza aligns *words*, not *phrases*. phrasal alignment is done as a post-process.
the links Chris mentioned attempt to also deal with phrases, with varying success. Miles 2009/2/27 James Read <[email protected]>: > I was under the impression that Giza++ already aligns phrases (i.e. n-grams). > > Quoting Chris Dyer <[email protected]>: > >>> Do you think this is possible? Would Giza++ require massive >>> modifications to be able to align these kind of tokens? My gut feeling >>> was that a n-gram with a gap in (a template) is to all intents and >>> purposes just the same as an n-gram and so the algorithm should >>> perform with similar accuracy. >> Giza operates in a word-by-word fashion. So, when you see multiple >> words aligning to the same thing, as far as the model is concerned, >> only accidental. Extending alignment models to deal with n-grams or >> n-grams with gaps makes them considerably more difficult to estimate, >> and makes Giza a poor starting point for such attempts. But, there >> has been a variety of work in this area though. For a starting point, >> you can look at: >> >> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability >> model for statistical machine translation. In Proceedings of EMNLP >> >> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for >> Statistical Machine Translation. >> >> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling >> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP >> >>> >>> Any thoughts? >>> >>> James >>> >>> Quoting Philipp Koehn <[email protected]>: >>> >>>> Hi, >>>> >>>> is the idea to replace certain parts of the text with tokens such as DATE >>>> and then align the rest of the sentence? I'd suggest to just reformat the >>>> training data, make sure that matching tokens are added to each sentence >>>> pair, and for good measure add 1000 sentences pairs that only contain >>>> DATE for input and output language. >>>> >>>> -phi >>>> >>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote: >>>>> Consider the following sentence pair. >>>>> >>>>> I declare resumed the session of the European Parliament >>>>> adjourned on Friday >>>>> 17 December 1999 >>>>> >>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode >>>>> des Europäischen Parlaments für wiederaufgenommen >>>>> >>>>> This sentence can be reduced to the following templates: >>>>> >>>>> I declare resumed the session of the European Parliament adjourned on ___ >>>>> >>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen >>>>> Parlaments für wiederaufgenommen >>>>> >>>>> Given a set of candidate tokens for such template could the current >>>>> implementation of Giza++ figure out which template pairs align or do you >>>>> think the code would need serious modifications? >>>>> >>>>> I hope this made my question clearer. >>>>> >>>>> >>>>> Quoting Philipp Koehn <[email protected]>: >>>>> >>>>>> Hi, >>>>>> >>>>>> not sure, what you are asking for - are you looking for phrasal >>>>>> alignments, in other words frequent occurrences of the example >>>>>> you mention? This is done by the phrase extraction scripts. >>>>>> >>>>>> -phi >>>>>> >>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> thanks to everybody for responses to my query about parallelising >>>>>>> Giza++. All the responses were very useful and have helped the project >>>>>>> make quick progress. >>>>>>> >>>>>>> The greater intention is to use Giza++ to automatically find template >>>>>>> translation pairs >>>>>>> >>>>>>> e.g. >>>>>>> >>>>>>> English - My name is x >>>>>>> Italian - Mi chiamo x >>>>>>> >>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its >>>>>>> current state to learning such pairs? Would it be a simple case of >>>>>>> presenting Giza++ with candidate tokens to align? Or would >>>>>>> modifications to the EM algorithms be necessary to accomplish this? >>>>>>> >>>>>>> Thanks in advance for any suggestions. >>>>>>> >>>>>>> James >>>>>>> >>>>>>> -- >>>>>>> The University of Edinburgh is a charitable body, registered in >>>>>>> Scotland, with registration number SC005336. >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> [email protected] >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >> >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
