Hi, On Thu, Aug 29, 2013, at 11:20, Francis Tyers wrote: > El dj 29 de 08 de 2013 a les 10:13 +0200, en/na Per Tunedal va escriure: > > Hi, > > the design of Apertium has some resemblance with the outdated > > word-to-word statistical translations models, especially the simplest: > > IBM model 1: > > 1 The translation is made word by word. > > 2. The most probable translation of a word is chosen (developers are > > advised to have only one translation in the bidix - the most common). > > 3. The translation is supposed to work best for closely related > > languages. > > > > Point 2 makes Apertium quite similar to IBM model 1 without the language > > model: then only the most probable word is chosen. Unfortunately, this > > often leads to terrible translations. > > Except: > > * You can use the lexical selection module, which can give equivalent > results to using a target-language model.
Sure. It's on the to do list. > * In IBM model 1 there is no reordering. True. But there isn't much need for reordering (if any) when translating between Swedish and Danish. That's why I've chosen to challenge Apertium by the simple IBM model 1. My task is now to beat that simple statistical translator, with your help I hope. > > > Thus, adding the language model to ensure "fluent" output should > > outperform Apertium. And it does. On closely related languages. > > > > I've written my own IBM model 1 training program and decoder > > (translator). I trained on the Block World Corpus and built 3-gram > > language models with IRSTML (available at > > http://www.tunedal.nu/download/block_world_corpus/). Finally I > > translated the evaluation files (available at the above site) from da to > > sv (and the other way around) and from sv to en (and the other way > > around). > > > > Results: > > 1. The translation between Swedish and English is mostly terrible (to a > > large extent due to that IBM 1 doesn't use any fertility i.e. one word > > only produces one translated word). > > 2. The translation between Swedish and Danish is in most cases > > acceptable. Only a few sentences are terrible. On the whole it looks > > much better than the translations from Apertium - in spite of my efforts > > since last year. > > The English data for the corpus is kind of weird (borderline > ungrammatical) in some places. Feel free to improve the English data. All improvements are welcome! > > Your efforts since last year have mostly made the pair worse not better. > This is probably unintentional, but was my impression last time I looked > at it. > True. Most of the problems are due to that I've postponed the tagger training, following your advice. The tagger performed badly from start and hasn't got a chance since I've changed the terminology in the dictionaries to comply with most langugaes, including Norwegian. The other problem is that I've introduced quite many synonyms. I hope that implementing your lexical selection module would take care of them. Finally, I have to trim the dictionaries. I might need some help with the script. > Fran > Yours, Per Tunedal ------------------------------------------------------------------------------ Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! Discover the easy way to master current and previous Microsoft technologies and advance your career. Get an incredible 1,500+ hours of step-by-step tutorial videos with LearnDevNow. Subscribe today and save! http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
