Re: [Moses-support] Giza++ input tokens (templates)

Philipp Koehn Thu, 26 Feb 2009 12:21:46 -0800

Hi,

it seems that what you want to learn is outside of GIZA++, you want to
learn patterns based on word alignments. Once you replace the patterns
from the corpus and replace them from the aligned sentences, you could
re-run the word alignment with GIZA++.


-phi

On Thu, Feb 26, 2009 at 7:38 PM, James Read <[email protected]> wrote:
> While this would work for DATE the problem is much more wide scale. There
> are many such kind of natural language templates in the Europarl corpus
> which I don't have the time to identify by hand. My aim is to learn them all
> in an unsupervised fashion. My hope was that the corpus could be parsed into
> candidate templates in some way (brute force removal of n-words from
> strings) and to pass these into Giza++ and get back some kind of reasonable
> probability of alignment.
>
> Do you think this is possible? Would Giza++ require massive modifications to
> be able to align these kind of tokens? My gut feeling was that a n-gram with
> a gap in (a template) is to all intents and purposes just the same as an
> n-gram and so the algorithm should perform with similar accuracy.
>
> Any thoughts?
>
> James
>
> Quoting Philipp Koehn <[email protected]>:
>
>> Hi,
>>
>> is the idea to replace certain parts of the text with tokens such as DATE
>> and then align the rest of the sentence? I'd suggest to just reformat the
>> training data, make sure that matching tokens are added to each sentence
>> pair, and for good measure add 1000 sentences pairs that only contain
>> DATE for input and output language.
>>
>> -phi
>>
>> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote:
>>>
>>> Consider the following sentence pair.
>>>
>>> I declare resumed the session of the European Parliament adjourned on
>>> Friday
>>> 17 December 1999
>>>
>>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene
>>> Sitzungsperiode
>>> des Europäischen Parlaments für wiederaufgenommen
>>>
>>> This sentence can be reduced to the following templates:
>>>
>>> I declare resumed the session of the European Parliament adjourned on ___
>>>
>>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen
>>> Parlaments für wiederaufgenommen
>>>
>>> Given a set of candidate tokens for such template could the current
>>> implementation of Giza++ figure out which template pairs align or do you
>>> think the code would need serious modifications?
>>>
>>> I hope this made my question clearer.
>>>
>>>
>>> Quoting Philipp Koehn <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> not sure, what you are asking for - are you looking for phrasal
>>>> alignments, in other words frequent occurrences of the example
>>>> you mention? This is done by the phrase extraction scripts.
>>>>
>>>> -phi
>>>>
>>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> thanks to everybody for responses to my query about parallelising
>>>>> Giza++. All the responses were very useful and have helped the project
>>>>> make quick progress.
>>>>>
>>>>> The greater intention is to use Giza++ to automatically find template
>>>>> translation pairs
>>>>>
>>>>> e.g.
>>>>>
>>>>> English - My name is x
>>>>> Italian - Mi chiamo x
>>>>>
>>>>> Does anybody have any ideas about how adaptable Giza++ is in its
>>>>> current state to learning such pairs? Would it be a simple case of
>>>>> presenting Giza++ with candidate tokens to align? Or would
>>>>> modifications to the EM algorithms be necessary to accomplish this?
>>>>>
>>>>> Thanks in advance for any suggestions.
>>>>>
>>>>> James
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Giza++ input tokens (templates)

Reply via email to