Ok. Thanks. Can you point me to anywhere that mentions how phrase  
alignment is done as a post process? Maybe this is an avenue I should  
also be considering.

Thanks

Quoting Miles Osborne <[email protected]>:

> no:  Giza aligns *words*, not *phrases*.  phrasal alignment is done as
> a post-process.
>
> the links Chris mentioned attempt to also deal with phrases, with
> varying success.
>
> Miles
>
> 2009/2/27 James Read <[email protected]>:
>> I was under the impression that Giza++ already aligns phrases (i.e.  
>>  n-grams).
>>
>> Quoting Chris Dyer <[email protected]>:
>>
>>>> Do you think this is possible? Would Giza++ require massive
>>>> modifications to be able to align these kind of tokens? My gut feeling
>>>> was that a n-gram with a gap in (a template) is to all intents and
>>>> purposes just the same as an n-gram and so the algorithm should
>>>> perform with similar accuracy.
>>> Giza operates in a word-by-word fashion.  So, when you see multiple
>>> words aligning to the same thing, as far as the model is concerned,
>>> only accidental.  Extending alignment models to deal with n-grams or
>>> n-grams with gaps makes them considerably more difficult to estimate,
>>> and makes Giza a poor starting point for such attempts.  But, there
>>> has been a variety of work in this area though.  For a starting point,
>>> you can look at:
>>>
>>> Daniel Marcu & William Wong. (2002) A phrase-based, joint probability
>>> model for statistical machine translation. In Proceedings of EMNLP
>>>
>>> Lexi Birch-Mayne. Scalable Phrase-Based, Joint Probability Model for
>>> Statistical Machine Translation.
>>>
>>> John DeNero, Alexandre Bouchard-Cote and Dan Klein. 2008. Sampling
>>> Alignment Structure under a Bayesian Translation Model. In Proc. EMNLP
>>>
>>>>
>>>> Any thoughts?
>>>>
>>>> James
>>>>
>>>> Quoting Philipp Koehn <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> is the idea to replace certain parts of the text with tokens such as DATE
>>>>> and then align the rest of the sentence? I'd suggest to just reformat the
>>>>> training data, make sure that matching tokens are added to each sentence
>>>>> pair, and for good measure add 1000 sentences pairs that only contain
>>>>> DATE for input and output language.
>>>>>
>>>>> -phi
>>>>>
>>>>> On Thu, Feb 26, 2009 at 1:38 AM, James Read   
>>>>> <[email protected]> wrote:
>>>>>> Consider the following sentence pair.
>>>>>>
>>>>>> I declare resumed the session of the European Parliament
>>>>>> adjourned on Friday
>>>>>> 17 December 1999
>>>>>>
>>>>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene   
>>>>>> Sitzungsperiode
>>>>>> des Europäischen Parlaments für wiederaufgenommen
>>>>>>
>>>>>> This sentence can be reduced to the following templates:
>>>>>>
>>>>>> I declare resumed the session of the European Parliament   
>>>>>> adjourned on ___
>>>>>>
>>>>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>>>>> Parlaments für wiederaufgenommen
>>>>>>
>>>>>> Given a set of candidate tokens for such template could the current
>>>>>> implementation of Giza++ figure out which template pairs align or do you
>>>>>> think the code would need serious modifications?
>>>>>>
>>>>>> I hope this made my question clearer.
>>>>>>
>>>>>>
>>>>>> Quoting Philipp Koehn <[email protected]>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> not sure, what you are asking for - are you looking for phrasal
>>>>>>> alignments, in other words frequent occurrences of the example
>>>>>>> you mention? This is done by the phrase extraction scripts.
>>>>>>>
>>>>>>> -phi
>>>>>>>
>>>>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> thanks to everybody for responses to my query about parallelising
>>>>>>>> Giza++. All the responses were very useful and have helped the project
>>>>>>>> make quick progress.
>>>>>>>>
>>>>>>>> The greater intention is to use Giza++ to automatically find template
>>>>>>>> translation pairs
>>>>>>>>
>>>>>>>> e.g.
>>>>>>>>
>>>>>>>> English - My name is x
>>>>>>>> Italian - Mi chiamo x
>>>>>>>>
>>>>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>>>>> current state to learning such pairs? Would it be a simple case of
>>>>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>>>>
>>>>>>>> Thanks in advance for any suggestions.
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> --
>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> [email protected]
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to