Re: [Moses-support] Giza++ input tokens (templates)

Chris Dyer Thu, 26 Feb 2009 11:54:16 -0800

> Do you think this is possible? Would Giza++ require massive
> modifications to be able to align these kind of tokens? My gut feeling
> was that a n-gram with a gap in (a template) is to all intents and
> purposes just the same as an n-gram and so the algorithm should
> perform with similar accuracy.
Giza operates in a word-by-word fashion.  So, when you see multiple
words aligning to the same thing, as far as the model is concerned,
only accidental.  Extending alignment models to deal with n-grams or
n-grams with gaps makes them considerably more difficult to estimate,
and makes Giza a poor starting point for such attempts.  But, there
has been a variety of work in this area though.  For a starting point,
you can look at:


Daniel Marcu & William Wong. (2002) A phrase-based, joint probability
model for statistical machine translation. In Proceedings of EMNLP

Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for
Statistical Machine Translation.

John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling
Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP

>
> Any thoughts?
>
> James
>
> Quoting Philipp Koehn <[email protected]>:
>
>> Hi,
>>
>> is the idea to replace certain parts of the text with tokens such as DATE
>> and then align the rest of the sentence? I'd suggest to just reformat the
>> training data, make sure that matching tokens are added to each sentence
>> pair, and for good measure add 1000 sentences pairs that only contain
>> DATE for input and output language.
>>
>> -phi
>>
>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote:
>>> Consider the following sentence pair.
>>>
>>> I declare resumed the session of the European Parliament adjourned on Friday
>>> 17 December 1999
>>>
>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode
>>> des Europäischen Parlaments für wiederaufgenommen
>>>
>>> This sentence can be reduced to the following templates:
>>>
>>> I declare resumed the session of the European Parliament adjourned on ___
>>>
>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>> Parlaments für wiederaufgenommen
>>>
>>> Given a set of candidate tokens for such template could the current
>>> implementation of Giza++ figure out which template pairs align or do you
>>> think the code would need serious modifications?
>>>
>>> I hope this made my question clearer.
>>>
>>>
>>> Quoting Philipp Koehn <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> not sure, what you are asking for - are you looking for phrasal
>>>> alignments, in other words frequent occurrences of the example
>>>> you mention? This is done by the phrase extraction scripts.
>>>>
>>>> -phi
>>>>
>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> thanks to everybody for responses to my query about parallelising
>>>>> Giza++. All the responses were very useful and have helped the project
>>>>> make quick progress.
>>>>>
>>>>> The greater intention is to use Giza++ to automatically find template
>>>>> translation pairs
>>>>>
>>>>> e.g.
>>>>>
>>>>> English - My name is x
>>>>> Italian - Mi chiamo x
>>>>>
>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>> current state to learning such pairs? Would it be a simple case of
>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>
>>>>> Thanks in advance for any suggestions.
>>>>>
>>>>> James
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Giza++ input tokens (templates)

Reply via email to