Re: [Apertium-stuff] Old fashoned SMT IBM model 1 outperforms Apertium

Per Tunedal Fri, 30 Aug 2013 01:41:08 -0700

Hi,

On Thu, Aug 29, 2013, at 11:20, Francis Tyers wrote:
> El dj 29 de 08 de 2013 a les 10:13 +0200, en/na Per Tunedal va escriure:
> > Hi,
> > the design of Apertium has some resemblance with the outdated
> > word-to-word statistical translations models, especially the simplest:
> > IBM model 1:
> > 1  The translation is made word by word.
> > 2. The most probable translation of a word is chosen (developers are
> > advised to have only one translation in the bidix - the most common).
> > 3. The translation is supposed to work best for closely related
> > languages.
> > 
> > Point 2 makes Apertium quite similar to IBM model 1 without the language
> > model: then only the most probable word is chosen. Unfortunately, this
> > often leads to terrible translations.
> 
> Except:
> 
> * You can use the lexical selection module, which can give equivalent
> results to using a target-language model.


Sure. It's on the to do list.

> * In IBM model 1 there is no reordering.

True. But there isn't much need for reordering (if any) when translating
between Swedish and Danish. That's why I've chosen to challenge Apertium
by the simple IBM model 1. My task is now to beat that simple
statistical translator, with your help I hope.

> 
> > Thus, adding the language model to ensure "fluent" output should
> > outperform Apertium. And it does. On closely related languages.
> > 
> > I've written my own IBM model 1 training program and decoder
> > (translator). I trained on the Block World Corpus and built 3-gram
> > language models with IRSTML (available at
> > http://www.tunedal.nu/download/block_world_corpus/). Finally I
> > translated the evaluation files (available at the above site) from da to
> > sv (and the other way around) and from sv to en (and the other way
> > around).
> > 
> > Results:
> > 1. The translation between Swedish and English is mostly terrible (to a
> > large extent due to that IBM 1 doesn't use any fertility i.e. one word
> > only produces one translated word).
> > 2. The translation between Swedish and Danish is in most cases
> > acceptable. Only a few sentences are terrible. On the whole it looks
> > much better than the translations from Apertium - in spite of my efforts
> > since last year.
> 
> The English data for the corpus is kind of weird (borderline
> ungrammatical) in some places. 

Feel free to improve the English data. All improvements are welcome!

> 
> Your efforts since last year have mostly made the pair worse not better.
> This is probably unintentional, but was my impression last time I looked
> at it.
> 

True. Most of the problems are due to that I've postponed the tagger
training, following your advice. The tagger performed badly from start
and hasn't got a chance since I've changed the terminology in the
dictionaries to comply with most langugaes, including Norwegian.

The other problem is that I've introduced quite many synonyms. I hope
that implementing your lexical selection module would take care of them.

Finally, I have to trim the dictionaries. I might need some help with
the script.


> Fran
> 

Yours,
Per Tunedal

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Old fashoned SMT IBM model 1 outperforms Apertium

Reply via email to