El dc 19 de 09 de 2012 a les 17:56 +0200, en/na Kevin Brubeck Unhammer va escriure: > I was about to do the release of apertium-sme-nob this spring, but then > FreeRBMT12 and everything happened … anyway, the North Sámi to Norwegian > Bokmål language pair is now officially released, with version 0.5.0 > uploaded to SourceForge[1]. Apart from the regular prerequisites, it > requires HFST[2] and vislcg3[3] installed (see the README). > > The language pair is mostly based on resources made by the Giellatekno > group at the University of Tromsø (this includes the sme > analyser+disambiguation and most of the nob-sme dictionary), and quite a > lot of people from that group have worked on it directly or indirectly, > including Berit N. Eskonsipo, Francis Tyers, Lene Antonsen, Linda > Wiechetek, Ritva Nystad, Sjur N. Moshagen and Trond Trosterud. > > The bilingual dictionary has 21243 entries (plus 43890 proper nouns), > the sme monolingual _should_ be auto-trimmed to those entries (it seemed > testvoc clean, but that's currently a bit hard to test with HFST > analysers). The nob dictionary is copied over from nn-nb and is > generation-only, so currently untrimmed with 98427 entries. > > The Giellatekno disambiguation Constraint Grammar has around 3700 rules, > there's also a lexical selection CG especially created for this language > pair with 102 rules. > > Transfer is four-stage: > 1. 63 chunking rules > 2. 26 coordination/postposition-cleanup rules > 3. 39 word order + pro-drop rules > 4. 29 trivial cleanup rules > > Coverage and ambiguity rate (number of analyses per token) over some > corpora, with and without derivational morphology turned on to see its > effect on coverage: > | Corpus | tokens | coverage | ambig.rate | > |--------+---------+--------------------+----------------------| > | laws | 51706 | 94.68% | 2.65 | > | wiki | 19942 | 77.52% | 2.36 | > | news | 1020250 | 94.72% | 2.59 | > |--------+---------+--------------------+----------------------| > | Corpus | tokens | coverage w/o deriv | ambig.rate w/o deriv | > |--------+---------+--------------------+----------------------| > | laws | 51706 | 86.02% | 2.32 | > | wiki | 19942 | 74.56% | 2.19 | > | news | 1020250 | 90.96% | 2.34 | > > WER results are around ~50%, ie. not really useful for post-editing, but > this is a long-distance language pair meant for gisting / "MT for > understanding" (it is also from minority to majority language, a similar > situation to eu→es I guess). There is a paper[4] and talk[5] presented > at FreeRBMT12 describing the language pair in more detail and showing > some fairly good evaluation results with respect to gisting. Also, I > find it very useful for gisting :-) > > Thanks to everyone involved in working on this project, and to > Giellatekno for paying me to do what I like best :-) Regarding future > work, some main goals are increasing coverage both of bidix and of > chunking rules, as well as improving disambiguation. There's always > stuff to do …
Fantastic work Unhammer! And all involved! :D It's great to finally get this released! Regarding the "not-useful-for-postedition" this could be one of the pairs to include in the "acat" tool. For an example of the output, take this article from Ávvir today: http://www.avvir.no/vivvo_general/2343.html What on earth is it about, well, Norwegian speakers can pass it through sme-nob and get (what looks to me a fairly serviceable gisting translation). And if you don't speak Norwegian, you can try the nursery/apertium-no-en output below (not bad either). http://pastebin.com/HHiMjCez I think this is our "heaviest" released language pair yet -- in terms of prerequisites (don't try and compile unless you have >2gb ram) and number of rules -- and really stretches the boundaries of what can be done with Apertium. This shows that the prospects of using Apertium for other Uralic languages to make gisting translators is probably pretty good too. Fran ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
