I was under the impression that Giza++ already aligns phrases (i.e. n-grams).

Quoting Chris Dyer <[email protected]>:

>> Do you think this is possible? Would Giza++ require massive
>> modifications to be able to align these kind of tokens? My gut feeling
>> was that a n-gram with a gap in (a template) is to all intents and
>> purposes just the same as an n-gram and so the algorithm should
>> perform with similar accuracy.
> Giza operates in a word-by-word fashion.  So, when you see multiple
> words aligning to the same thing, as far as the model is concerned,
> only accidental.  Extending alignment models to deal with n-grams or
> n-grams with gaps makes them considerably more difficult to estimate,
> and makes Giza a poor starting point for such attempts.  But, there
> has been a variety of work in this area though.  For a starting point,
> you can look at:
>
> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability
> model for statistical machine translation. In Proceedings of EMNLP
>
> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for
> Statistical Machine Translation.
>
> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling
> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP
>
>>
>> Any thoughts?
>>
>> James
>>
>> Quoting Philipp Koehn <[email protected]>:
>>
>>> Hi,
>>>
>>> is the idea to replace certain parts of the text with tokens such as DATE
>>> and then align the rest of the sentence? I'd suggest to just reformat the
>>> training data, make sure that matching tokens are added to each sentence
>>> pair, and for good measure add 1000 sentences pairs that only contain
>>> DATE for input and output language.
>>>
>>> -phi
>>>
>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote:
>>>> Consider the following sentence pair.
>>>>
>>>> I declare resumed the session of the European Parliament   
>>>> adjourned on Friday
>>>> 17 December 1999
>>>>
>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene Sitzungsperiode
>>>> des Europäischen Parlaments für wiederaufgenommen
>>>>
>>>> This sentence can be reduced to the following templates:
>>>>
>>>> I declare resumed the session of the European Parliament adjourned on ___
>>>>
>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>>> Parlaments für wiederaufgenommen
>>>>
>>>> Given a set of candidate tokens for such template could the current
>>>> implementation of Giza++ figure out which template pairs align or do you
>>>> think the code would need serious modifications?
>>>>
>>>> I hope this made my question clearer.
>>>>
>>>>
>>>> Quoting Philipp Koehn <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> not sure, what you are asking for - are you looking for phrasal
>>>>> alignments, in other words frequent occurrences of the example
>>>>> you mention? This is done by the phrase extraction scripts.
>>>>>
>>>>> -phi
>>>>>
>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read   
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thanks to everybody for responses to my query about parallelising
>>>>>> Giza++. All the responses were very useful and have helped the project
>>>>>> make quick progress.
>>>>>>
>>>>>> The greater intention is to use Giza++ to automatically find template
>>>>>> translation pairs
>>>>>>
>>>>>> e.g.
>>>>>>
>>>>>> English - My name is x
>>>>>> Italian - Mi chiamo x
>>>>>>
>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>>> current state to learning such pairs? Would it be a simple case of
>>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>>
>>>>>> Thanks in advance for any suggestions.
>>>>>>
>>>>>> James
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to