Hi, I'm pleased to hear of the plans of Jonas Fromseier Mortensen to start working on Norwegian-Danish (no-da), including both bokmål (nb) and nynorsk (nn). That would make it much easier for me to realize my original plan to set up the pair Norwegian-Swedish (no-sv), me too including both bokmål (nb) and nynorsk (nn).
What has refrained me from starting the work so far, is that I was pushed into first fixing "some minor issues" with the pair Swedish-Danish (sv-da). OK, I'll give it a week, I thought, and have now spent a year! My goal was to fix the most blatant errors and extend the dictionaries to include more words used in ordinary life, rather than in the EU Parliament. Further I wanted to release the other translation direction, Danish to Swedish (da-sv). Status as today: 1. I've fixed some errors but many are yet to be found and tackled. Some errors might be fixed by retraining the tagger, writing some clever transfer rules and using the new disambiguator: that remains for me to try. 2. I've added quite a few new words, mainly by: a) adding entries from the pair Icelandic-Swedish (is-sv) b) gold-washing from various sources by using wish-list of Danish and Swedish words. - I hoped that many of the words would "meet in the middle", i.e. would be present in both monodixies, letting me just add the translation in the bidix. Unfortunately, this only happened for about a third of the added words. Consequently, I have to add some words manually to the monodixies. - By now, I've added most of the found wanted nouns and verbs. I have simply skipped all words I haven't managed to translate effortlessly. - Many common adjectives and adverbs remains to add. Further, I've added quite a few abbreviations and some common false friends I know of. I've also started some work on pronouns - many are still missing. Working with the bidix has revealed that many of the words in the Danish dictionary (much larger than the Swedish dictionary) are simply non-existent. All the same, they are nicely put into the monodix with valid paradigms. Apparently, one or more of the semi-automatic tools has gone havoc. This is a minor problem for me, as they will all go away when I trim the dictionaries, but might be a nuisance for Jonas while working on the new pair Norwegian-Danish (no-da). An other problem is that my knowledge of Danish is very limited. I have tried to make some informed guesses, with the help of dictionaries and an introductory grammar. All the same, some of my entries, especially in the Danish monodix, might be erroneous. It might be a good idea to take a glance at them (marked by my initials PT). Maybe expanding the monodix and looking for odd entries. Or translating some test texts and spotting errors. The translation is still very poor, and unfortunately I believe that this is very hard to fix. I've identified the tagger and word disambiguation as the critical steps. I've come to the conclusion that it's silly to let the tagger choose one and only one translation. A better disambiguation would be most helpful. Maybe it would be possible to translate all possible matches, disregarding the part of speech, and later choose the translation that makes most sense/is the most fluent in the target language? Or use a disambiguator instead of the tagger? I will gladly discuss this in a separate thread. Right now, I'm quite busy with other projects, so I cannot do much work on Apertium. On the other hand I'm always interested in having a discussion. Yours, Per Tunedal ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
