Hi, it seems that what you want to learn is outside of GIZA++, you want to learn patterns based on word alignments. Once you replace the patterns from the corpus and replace them from the aligned sentences, you could re-run the word alignment with GIZA++.
-phi On Thu, Feb 26, 2009 at 7:38 PM, James Read <[email protected]> wrote: > While this would work for DATE the problem is much more wide scale. There > are many such kind of natural language templates in the Europarl corpus > which I don't have the time to identify by hand. My aim is to learn them all > in an unsupervised fashion. My hope was that the corpus could be parsed into > candidate templates in some way (brute force removal of n-words from > strings) and to pass these into Giza++ and get back some kind of reasonable > probability of alignment. > > Do you think this is possible? Would Giza++ require massive modifications to > be able to align these kind of tokens? My gut feeling was that a n-gram with > a gap in (a template) is to all intents and purposes just the same as an > n-gram and so the algorithm should perform with similar accuracy. > > Any thoughts? > > James > > Quoting Philipp Koehn <[email protected]>: > >> Hi, >> >> is the idea to replace certain parts of the text with tokens such as DATE >> and then align the rest of the sentence? I'd suggest to just reformat the >> training data, make sure that matching tokens are added to each sentence >> pair, and for good measure add 1000 sentences pairs that only contain >> DATE for input and output language. >> >> -phi >> >> On Thu, Feb 26, 2009 at 1:38 AM, James Read <[email protected]> wrote: >>> >>> Consider the following sentence pair. >>> >>> I declare resumed the session of the European Parliament adjourned on >>> Friday >>> 17 December 1999 >>> >>> Ich erkläre die am Freitag, dem 17. Dezember unterbrochene >>> Sitzungsperiode >>> des Europäischen Parlaments für wiederaufgenommen >>> >>> This sentence can be reduced to the following templates: >>> >>> I declare resumed the session of the European Parliament adjourned on ___ >>> >>> Ich erkläre die am ___ unterbrochene Sitzungsperiode des Europäischen >>> Parlaments für wiederaufgenommen >>> >>> Given a set of candidate tokens for such template could the current >>> implementation of Giza++ figure out which template pairs align or do you >>> think the code would need serious modifications? >>> >>> I hope this made my question clearer. >>> >>> >>> Quoting Philipp Koehn <[email protected]>: >>> >>>> Hi, >>>> >>>> not sure, what you are asking for - are you looking for phrasal >>>> alignments, in other words frequent occurrences of the example >>>> you mention? This is done by the phrase extraction scripts. >>>> >>>> -phi >>>> >>>> On Wed, Feb 25, 2009 at 1:04 PM, James Read <[email protected]> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> thanks to everybody for responses to my query about parallelising >>>>> Giza++. All the responses were very useful and have helped the project >>>>> make quick progress. >>>>> >>>>> The greater intention is to use Giza++ to automatically find template >>>>> translation pairs >>>>> >>>>> e.g. >>>>> >>>>> English - My name is x >>>>> Italian - Mi chiamo x >>>>> >>>>> Does anybody have any ideas about how adaptable Giza++ is in its >>>>> current state to learning such pairs? Would it be a simple case of >>>>> presenting Giza++ with candidate tokens to align? Or would >>>>> modifications to the EM algorithms be necessary to accomplish this? >>>>> >>>>> Thanks in advance for any suggestions. >>>>> >>>>> James >>>>> >>>>> -- >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> >> >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
