I was about to do the release of apertium-sme-nob this spring, but then
FreeRBMT12 and everything happened … anyway, the North Sámi to Norwegian
Bokmål language pair is now officially released, with version 0.5.0
uploaded to SourceForge[1]. Apart from the regular prerequisites, it
requires HFST[2] and vislcg3[3] installed (see the README).

The language pair is mostly based on resources made by the Giellatekno
group at the University of Tromsø (this includes the sme
analyser+disambiguation and most of the nob-sme dictionary), and quite a
lot of people from that group have worked on it directly or indirectly,
including Berit N. Eskonsipo, Francis Tyers, Lene Antonsen, Linda
Wiechetek, Ritva Nystad, Sjur N. Moshagen and Trond Trosterud.

The bilingual dictionary has 21243 entries (plus 43890 proper nouns),
the sme monolingual _should_ be auto-trimmed to those entries (it seemed
testvoc clean, but that's currently a bit hard to test with HFST
analysers). The nob dictionary is copied over from nn-nb and is
generation-only, so currently untrimmed with 98427 entries.

The Giellatekno disambiguation Constraint Grammar has around 3700 rules,
there's also a lexical selection CG especially created for this language
pair with 102 rules.

Transfer is four-stage:
1. 63 chunking rules
2. 26 coordination/postposition-cleanup rules
3. 39 word order + pro-drop rules
4. 29 trivial cleanup rules

Coverage and ambiguity rate (number of analyses per token) over some
corpora, with and without derivational morphology turned on to see its
effect on coverage:
| Corpus |  tokens |           coverage |           ambig.rate |
|--------+---------+--------------------+----------------------|
| laws   |   51706 |             94.68% |                 2.65 |
| wiki   |   19942 |             77.52% |                 2.36 |
| news   | 1020250 |             94.72% |                 2.59 |
|--------+---------+--------------------+----------------------|
| Corpus |  tokens | coverage w/o deriv | ambig.rate w/o deriv |
|--------+---------+--------------------+----------------------|
| laws   |   51706 |             86.02% |                 2.32 |
| wiki   |   19942 |             74.56% |                 2.19 |
| news   | 1020250 |             90.96% |                 2.34 |

WER results are around ~50%, ie. not really useful for post-editing, but
this is a long-distance language pair meant for gisting / "MT for
understanding" (it is also from minority to majority language, a similar
situation to eu→es I guess). There is a paper[4] and talk[5] presented
at FreeRBMT12 describing the language pair in more detail and showing
some fairly good evaluation results with respect to gisting. Also, I
find it very useful for gisting :-)

Thanks to everyone involved in working on this project, and to
Giellatekno for paying me to do what I like best :-) Regarding future
work, some main goals are increasing coverage both of bidix and of
chunking rules, as well as improving disambiguation. There's always
stuff to do …





[1]  https://sourceforge.net/projects/apertium/files/apertium-sme-nob/

[2]  http://wiki.apertium.org/wiki/HFST

[3]  http://wiki.apertium.org/wiki/Apertium_and_Constraint_Grammar

[4]  
http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/paper/sme-nob.tex?content-type=text%2Fplain

[5]  
http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/apertium-sme-nob/paper/talk/smenob_talk.tex?content-type=text%2Fplain


-- 
Kevin Brubeck Unhammer

GPG: 0x766AC60C


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to