El dc 19 de 09 de 2012 a les 17:56 +0200, en/na Kevin Brubeck Unhammer
va escriure:
> I was about to do the release of apertium-sme-nob this spring, but then
> FreeRBMT12 and everything happened … anyway, the North Sámi to Norwegian
> Bokmål language pair is now officially released, with version 0.5.0
> uploaded to SourceForge[1]. Apart from the regular prerequisites, it
> requires HFST[2] and vislcg3[3] installed (see the README).
> 
> The language pair is mostly based on resources made by the Giellatekno
> group at the University of Tromsø (this includes the sme
> analyser+disambiguation and most of the nob-sme dictionary), and quite a
> lot of people from that group have worked on it directly or indirectly,
> including Berit N. Eskonsipo, Francis Tyers, Lene Antonsen, Linda
> Wiechetek, Ritva Nystad, Sjur N. Moshagen and Trond Trosterud.
> 
> The bilingual dictionary has 21243 entries (plus 43890 proper nouns),
> the sme monolingual _should_ be auto-trimmed to those entries (it seemed
> testvoc clean, but that's currently a bit hard to test with HFST
> analysers). The nob dictionary is copied over from nn-nb and is
> generation-only, so currently untrimmed with 98427 entries.
> 
> The Giellatekno disambiguation Constraint Grammar has around 3700 rules,
> there's also a lexical selection CG especially created for this language
> pair with 102 rules.
> 
> Transfer is four-stage:
> 1. 63 chunking rules
> 2. 26 coordination/postposition-cleanup rules
> 3. 39 word order + pro-drop rules
> 4. 29 trivial cleanup rules
> 
> Coverage and ambiguity rate (number of analyses per token) over some
> corpora, with and without derivational morphology turned on to see its
> effect on coverage:
> | Corpus |  tokens |           coverage |           ambig.rate |
> |--------+---------+--------------------+----------------------|
> | laws   |   51706 |             94.68% |                 2.65 |
> | wiki   |   19942 |             77.52% |                 2.36 |
> | news   | 1020250 |             94.72% |                 2.59 |
> |--------+---------+--------------------+----------------------|
> | Corpus |  tokens | coverage w/o deriv | ambig.rate w/o deriv |
> |--------+---------+--------------------+----------------------|
> | laws   |   51706 |             86.02% |                 2.32 |
> | wiki   |   19942 |             74.56% |                 2.19 |
> | news   | 1020250 |             90.96% |                 2.34 |
> 
> WER results are around ~50%, ie. not really useful for post-editing, but
> this is a long-distance language pair meant for gisting / "MT for
> understanding" (it is also from minority to majority language, a similar
> situation to eu→es I guess). There is a paper[4] and talk[5] presented
> at FreeRBMT12 describing the language pair in more detail and showing
> some fairly good evaluation results with respect to gisting. Also, I
> find it very useful for gisting :-)
> 
> Thanks to everyone involved in working on this project, and to
> Giellatekno for paying me to do what I like best :-) Regarding future
> work, some main goals are increasing coverage both of bidix and of
> chunking rules, as well as improving disambiguation. There's always
> stuff to do …

Fantastic work Unhammer! And all involved! :D It's great to finally get
this released!

Regarding the "not-useful-for-postedition" this could be one of the
pairs to include in the "acat" tool. 

For an example of the output, take this article from Ávvir today:

http://www.avvir.no/vivvo_general/2343.html

What on earth is it about, well, Norwegian speakers can pass it through
sme-nob and get (what looks to me a fairly serviceable gisting
translation). And if you don't speak Norwegian, you can try the
nursery/apertium-no-en output below (not bad either).

  http://pastebin.com/HHiMjCez

I think this is our "heaviest" released language pair yet -- in terms of
prerequisites (don't try and compile unless you have >2gb ram) and
number of rules -- and really stretches the boundaries of what can be
done with Apertium. 

This shows that the prospects of using Apertium for other Uralic
languages to make gisting translators is probably pretty good too.

Fran


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to