Ok. Thanks. Can you point me to anywhere that mentions how phrase alignment is done as a post process? Maybe this is an avenue I should also be considering.
Thanks Quoting Miles Osborne <[email protected]>: > no: Giza aligns *words*, not *phrases*. phrasal alignment is done as > a post-process. > > the links Chris mentioned attempt to also deal with phrases, with > varying success. > > Miles > > 2009/2/27 James Read <[email protected]>: >> I was under the impression that Giza++ already aligns phrases (i.e. >> n-grams). >> >> Quoting Chris Dyer <[email protected]>: >> >>>> Do you think this is possible? Would Giza++ require massive >>>> modifications to be able to align these kind of tokens? My gut feeling >>>> was that a n-gram with a gap in (a template) is to all intents and >>>> purposes just the same as an n-gram and so the algorithm should >>>> perform with similar accuracy. >>> Giza operates in a word-by-word fashion. So, when you see multiple >>> words aligning to the same thing, as far as the model is concerned, >>> only accidental. Extending alignment models to deal with n-grams or >>> n-grams with gaps makes them considerably more difficult to estimate, >>> and makes Giza a poor starting point for such attempts. But, there >>> has been a variety of work in this area though. For a starting point, >>> you can look at: >>> >>> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability >>> model for statistical machine translation. In Proceedings of EMNLP >>> >>> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for >>> Statistical Machine Translation. >>> >>> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling >>> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP >>> >>>> >>>> Any thoughts? >>>> >>>> James >>>> >>>> Quoting Philipp Koehn <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> is the idea to replace certain parts of the text with tokens such as DATE >>>>> and then align the rest of the sentence? I'd suggest to just reformat the >>>>> training data, make sure that matching tokens are added to each sentence >>>>> pair, and for good measure add 1000 sentences pairs that only contain >>>>> DATE for input and output language. >>>>> >>>>> -phi >>>>> >>>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read >>>>> <[email protected]> wrote: >>>>>> Consider the following sentence pair. >>>>>> >>>>>> I declare resumed the session of the European Parliament >>>>>> adjourned on Friday >>>>>> 17 December 1999 >>>>>> >>>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene >>>>>> Sitzungsperiode >>>>>> des Europäischen Parlaments für wiederaufgenommen >>>>>> >>>>>> This sentence can be reduced to the following templates: >>>>>> >>>>>> I declare resumed the session of the European Parliament >>>>>> adjourned on ___ >>>>>> >>>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen >>>>>> Parlaments für wiederaufgenommen >>>>>> >>>>>> Given a set of candidate tokens for such template could the current >>>>>> implementation of Giza++ figure out which template pairs align or do you >>>>>> think the code would need serious modifications? >>>>>> >>>>>> I hope this made my question clearer. >>>>>> >>>>>> >>>>>> Quoting Philipp Koehn <[email protected]>: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> not sure, what you are asking for - are you looking for phrasal >>>>>>> alignments, in other words frequent occurrences of the example >>>>>>> you mention? This is done by the phrase extraction scripts. >>>>>>> >>>>>>> -phi >>>>>>> >>>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read >>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> thanks to everybody for responses to my query about parallelising >>>>>>>> Giza++. All the responses were very useful and have helped the project >>>>>>>> make quick progress. >>>>>>>> >>>>>>>> The greater intention is to use Giza++ to automatically find template >>>>>>>> translation pairs >>>>>>>> >>>>>>>> e.g. >>>>>>>> >>>>>>>> English - My name is x >>>>>>>> Italian - Mi chiamo x >>>>>>>> >>>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its >>>>>>>> current state to learning such pairs? Would it be a simple case of >>>>>>>> presenting Giza++ with candidate tokens to align? Or would >>>>>>>> modifications to the EM algorithms be necessary to accomplish this? >>>>>>>> >>>>>>>> Thanks in advance for any suggestions. >>>>>>>> >>>>>>>> James >>>>>>>> >>>>>>>> -- >>>>>>>> The University of Edinburgh is a charitable body, registered in >>>>>>>> Scotland, with registration number SC005336. >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Moses-support mailing list >>>>>>>> [email protected] >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
