Re: [Moses-support] Giza++ input tokens (templates)

Miles Osborne Fri, 27 Feb 2009 00:51:13 -0800

no:  Giza aligns *words*, not *phrases*.  phrasal alignment is done as
a post-process.


the links Chris mentioned attempt to also deal with phrases, with
varying success.

Miles

2009/2/27 James Read <[email protected]>:
> I was under the impression that Giza++ already aligns phrases (i.e. n-grams).
>
> Quoting Chris Dyer <[email protected]>:
>
>>> Do you think this is possible? Would Giza++ require massive
>>> modifications to be able to align these kind of tokens? My gut feeling
>>> was that a n-gram with a gap in (a template) is to all intents and
>>> purposes just the same as an n-gram and so the algorithm should
>>> perform with similar accuracy.
>> Giza operates in a word-by-word fashion.  So, when you see multiple
>> words aligning to the same thing, as far as the model is concerned,
>> only accidental.  Extending alignment models to deal with n-grams or
>> n-grams with gaps makes them considerably more difficult to estimate,
>> and makes Giza a poor starting point for such attempts.  But, there
>> has been a variety of work in this area though.  For a starting point,
>> you can look at:
>>
>> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability
>> model for statistical machine translation. In Proceedings of EMNLP
>>
>> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for
>> Statistical Machine Translation.
>>
>> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling
>> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP
>>
>>>
>>> Any thoughts?
>>>
>>> James
>>>
>>> Quoting Philipp Koehn <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> is the idea to replace certain parts of the text with tokens such as DATE
>>>> and then align the rest of the sentence? I'd suggest to just reformat the
>>>> training data, make sure that matching tokens are added to each sentence
>>>> pair, and for good measure add 1000 sentences pairs that only contain
>>>> DATE for input and output language.
>>>>
>>>> -phi
>>>>
>>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote:
>>>>> Consider the following sentence pair.
>>>>>
>>>>> I declare resumed the session of the European Parliament
>>>>> adjourned on Friday
>>>>> 17 December 1999
>>>>>
>>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode
>>>>> des Europäischen Parlaments für wiederaufgenommen
>>>>>
>>>>> This sentence can be reduced to the following templates:
>>>>>
>>>>> I declare resumed the session of the European Parliament adjourned on ___
>>>>>
>>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>>>> Parlaments für wiederaufgenommen
>>>>>
>>>>> Given a set of candidate tokens for such template could the current
>>>>> implementation of Giza++ figure out which template pairs align or do you
>>>>> think the code would need serious modifications?
>>>>>
>>>>> I hope this made my question clearer.
>>>>>
>>>>>
>>>>> Quoting Philipp Koehn <[email protected]>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> not sure, what you are asking for - are you looking for phrasal
>>>>>> alignments, in other words frequent occurrences of the example
>>>>>> you mention? This is done by the phrase extraction scripts.
>>>>>>
>>>>>> -phi
>>>>>>
>>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> thanks to everybody for responses to my query about parallelising
>>>>>>> Giza++. All the responses were very useful and have helped the project
>>>>>>> make quick progress.
>>>>>>>
>>>>>>> The greater intention is to use Giza++ to automatically find template
>>>>>>> translation pairs
>>>>>>>
>>>>>>> e.g.
>>>>>>>
>>>>>>> English - My name is x
>>>>>>> Italian - Mi chiamo x
>>>>>>>
>>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>>>> current state to learning such pairs? Would it be a simple case of
>>>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>>>
>>>>>>> Thanks in advance for any suggestions.
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> --
>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>> Scotland, with registration number SC005336.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Giza++ input tokens (templates)

Reply via email to