I was about to do the release of apertium-sme-nob this spring, but then FreeRBMT12 and everything happened … anyway, the North Sámi to Norwegian Bokmål language pair is now officially released, with version 0.5.0 uploaded to SourceForge[1]. Apart from the regular prerequisites, it requires HFST[2] and vislcg3[3] installed (see the README).
The language pair is mostly based on resources made by the Giellatekno group at the University of Tromsø (this includes the sme analyser+disambiguation and most of the nob-sme dictionary), and quite a lot of people from that group have worked on it directly or indirectly, including Berit N. Eskonsipo, Francis Tyers, Lene Antonsen, Linda Wiechetek, Ritva Nystad, Sjur N. Moshagen and Trond Trosterud. The bilingual dictionary has 21243 entries (plus 43890 proper nouns), the sme monolingual _should_ be auto-trimmed to those entries (it seemed testvoc clean, but that's currently a bit hard to test with HFST analysers). The nob dictionary is copied over from nn-nb and is generation-only, so currently untrimmed with 98427 entries. The Giellatekno disambiguation Constraint Grammar has around 3700 rules, there's also a lexical selection CG especially created for this language pair with 102 rules. Transfer is four-stage: 1. 63 chunking rules 2. 26 coordination/postposition-cleanup rules 3. 39 word order + pro-drop rules 4. 29 trivial cleanup rules Coverage and ambiguity rate (number of analyses per token) over some corpora, with and without derivational morphology turned on to see its effect on coverage: | Corpus | tokens | coverage | ambig.rate | |--------+---------+--------------------+----------------------| | laws | 51706 | 94.68% | 2.65 | | wiki | 19942 | 77.52% | 2.36 | | news | 1020250 | 94.72% | 2.59 | |--------+---------+--------------------+----------------------| | Corpus | tokens | coverage w/o deriv | ambig.rate w/o deriv | |--------+---------+--------------------+----------------------| | laws | 51706 | 86.02% | 2.32 | | wiki | 19942 | 74.56% | 2.19 | | news | 1020250 | 90.96% | 2.34 | WER results are around ~50%, ie. not really useful for post-editing, but this is a long-distance language pair meant for gisting / "MT for understanding" (it is also from minority to majority language, a similar situation to eu→es I guess). There is a paper[4] and talk[5] presented at FreeRBMT12 describing the language pair in more detail and showing some fairly good evaluation results with respect to gisting. Also, I find it very useful for gisting :-) Thanks to everyone involved in working on this project, and to Giellatekno for paying me to do what I like best :-) Regarding future work, some main goals are increasing coverage both of bidix and of chunking rules, as well as improving disambiguation. There's always stuff to do … [1] https://sourceforge.net/projects/apertium/files/apertium-sme-nob/ [2] http://wiki.apertium.org/wiki/HFST [3] http://wiki.apertium.org/wiki/Apertium_and_Constraint_Grammar [4] http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/paper/sme-nob.tex?content-type=text%2Fplain [5] http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/paper/talk/smenob_talk.tex?content-type=text%2Fplain -- Kevin Brubeck Unhammer GPG: 0x766AC60C ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
