Re: [Moses-support] Giza++ input tokens (templates)

James Read Fri, 27 Feb 2009 01:13:35 -0800

Thanks for the off-list link. It was very useful.

Forgive me for my ignorance but what exactly is the problem with using  
Giza++ for n-gram alignment? A single word is just a string of  
letters. An n-gram is a string of letters with some spaces in between.  
Why should using Giza for aligning strings of letters with spaces in  
between be any different to aligning strings of letters? Is this just  
a problem of computation time and limited computational resources?


Quoting Miles Osborne <[email protected]>:

> no:  Giza aligns *words*, not *phrases*.  phrasal alignment is done as
> a post-process.
>
> the links Chris mentioned attempt to also deal with phrases, with
> varying success.
>
> Miles
>
> 2009/2/27 James Read <[email protected]>:
>> I was under the impression that Giza++ already aligns phrases (i.e.  
>>  n-grams).
>>
>> Quoting Chris Dyer <[email protected]>:
>>
>>>> Do you think this is possible? Would Giza++ require massive
>>>> modifications to be able to align these kind of tokens? My gut feeling
>>>> was that a n-gram with a gap in (a template) is to all intents and
>>>> purposes just the same as an n-gram and so the algorithm should
>>>> perform with similar accuracy.
>>> Giza operates in a word-by-word fashion.  So, when you see multiple
>>> words aligning to the same thing, as far as the model is concerned,
>>> only accidental.  Extending alignment models to deal with n-grams or
>>> n-grams with gaps makes them considerably more difficult to estimate,
>>> and makes Giza a poor starting point for such attempts.  But, there
>>> has been a variety of work in this area though.  For a starting point,
>>> you can look at:
>>>
>>> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability
>>> model for statistical machine translation. In Proceedings of EMNLP
>>>
>>> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for
>>> Statistical Machine Translation.
>>>
>>> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling
>>> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP
>>>
>>>>
>>>> Any thoughts?
>>>>
>>>> James
>>>>
>>>> Quoting Philipp Koehn <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> is the idea to replace certain parts of the text with tokens such as DATE
>>>>> and then align the rest of the sentence? I'd suggest to just reformat the
>>>>> training data, make sure that matching tokens are added to each sentence
>>>>> pair, and for good measure add 1000 sentences pairs that only contain
>>>>> DATE for input and output language.
>>>>>
>>>>> -phi
>>>>>
>>>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read   
>>>>> <[email protected]> wrote:
>>>>>> Consider the following sentence pair.
>>>>>>
>>>>>> I declare resumed the session of the European Parliament
>>>>>> adjourned on Friday
>>>>>> 17 December 1999
>>>>>>
>>>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene   
>>>>>> Sitzungsperiode
>>>>>> des Europäischen Parlaments für wiederaufgenommen
>>>>>>
>>>>>> This sentence can be reduced to the following templates:
>>>>>>
>>>>>> I declare resumed the session of the European Parliament   
>>>>>> adjourned on ___
>>>>>>
>>>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>>>>> Parlaments für wiederaufgenommen
>>>>>>
>>>>>> Given a set of candidate tokens for such template could the current
>>>>>> implementation of Giza++ figure out which template pairs align or do you
>>>>>> think the code would need serious modifications?
>>>>>>
>>>>>> I hope this made my question clearer.
>>>>>>
>>>>>>
>>>>>> Quoting Philipp Koehn <[email protected]>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> not sure, what you are asking for - are you looking for phrasal
>>>>>>> alignments, in other words frequent occurrences of the example
>>>>>>> you mention? This is done by the phrase extraction scripts.
>>>>>>>
>>>>>>> -phi
>>>>>>>
>>>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> thanks to everybody for responses to my query about parallelising
>>>>>>>> Giza++. All the responses were very useful and have helped the project
>>>>>>>> make quick progress.
>>>>>>>>
>>>>>>>> The greater intention is to use Giza++ to automatically find template
>>>>>>>> translation pairs
>>>>>>>>
>>>>>>>> e.g.
>>>>>>>>
>>>>>>>> English - My name is x
>>>>>>>> Italian - Mi chiamo x
>>>>>>>>
>>>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>>>>> current state to learning such pairs? Would it be a simple case of
>>>>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>>>>
>>>>>>>> Thanks in advance for any suggestions.
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> --
>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> [email protected]
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Giza++ input tokens (templates)

Reply via email to