I was under the impression that Giza++ already aligns phrases (i.e. n-grams).
Quoting Chris Dyer <[email protected]>: >> Do you think this is possible? Would Giza++ require massive >> modifications to be able to align these kind of tokens? My gut feeling >> was that a n-gram with a gap in (a template) is to all intents and >> purposes just the same as an n-gram and so the algorithm should >> perform with similar accuracy. > Giza operates in a word-by-word fashion. So, when you see multiple > words aligning to the same thing, as far as the model is concerned, > only accidental. Extending alignment models to deal with n-grams or > n-grams with gaps makes them considerably more difficult to estimate, > and makes Giza a poor starting point for such attempts. But, there > has been a variety of work in this area though. For a starting point, > you can look at: > > Daniel Marcu & William Wong. (2002) A phrase-based, joint probability > model for statistical machine translation. In Proceedings of EMNLP > > Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for > Statistical Machine Translation. > > John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling > Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP > >> >> Any thoughts? >> >> James >> >> Quoting Philipp Koehn <[email protected]>: >> >>> Hi, >>> >>> is the idea to replace certain parts of the text with tokens such as DATE >>> and then align the rest of the sentence? I'd suggest to just reformat the >>> training data, make sure that matching tokens are added to each sentence >>> pair, and for good measure add 1000 sentences pairs that only contain >>> DATE for input and output language. >>> >>> -phi >>> >>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote: >>>> Consider the following sentence pair. >>>> >>>> I declare resumed the session of the European Parliament >>>> adjourned on Friday >>>> 17 December 1999 >>>> >>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode >>>> des Europäischen Parlaments für wiederaufgenommen >>>> >>>> This sentence can be reduced to the following templates: >>>> >>>> I declare resumed the session of the European Parliament adjourned on ___ >>>> >>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen >>>> Parlaments für wiederaufgenommen >>>> >>>> Given a set of candidate tokens for such template could the current >>>> implementation of Giza++ figure out which template pairs align or do you >>>> think the code would need serious modifications? >>>> >>>> I hope this made my question clearer. >>>> >>>> >>>> Quoting Philipp Koehn <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> not sure, what you are asking for - are you looking for phrasal >>>>> alignments, in other words frequent occurrences of the example >>>>> you mention? This is done by the phrase extraction scripts. >>>>> >>>>> -phi >>>>> >>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read >>>>> <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> thanks to everybody for responses to my query about parallelising >>>>>> Giza++. All the responses were very useful and have helped the project >>>>>> make quick progress. >>>>>> >>>>>> The greater intention is to use Giza++ to automatically find template >>>>>> translation pairs >>>>>> >>>>>> e.g. >>>>>> >>>>>> English - My name is x >>>>>> Italian - Mi chiamo x >>>>>> >>>>>> Does anybody have any ideas about how adaptable Giza++ is in its >>>>>> current state to learning such pairs? Would it be a simple case of >>>>>> presenting Giza++ with candidate tokens to align? Or would >>>>>> modifications to the EM algorithms be necessary to accomplish this? >>>>>> >>>>>> Thanks in advance for any suggestions. >>>>>> >>>>>> James >>>>>> >>>>>> -- >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
