To try and get the ball rolling... we've got less than a week left ...
Here are some ideas that I had for GSOC:
1) Combining Brill-tagger style transformation-based learning and
Felipe's "supervised to unsupervised with fractional counts" to
automatically generate constraint grammar (or constraint-grammar style)
rules for morphological disambiguation. This would involve 1-way and
n-way training (e.g. using >1 system to learn with).
2) An interface for working with .dix files. Not the typical "click here
to add a word" interface, but something for more advanced users. The
main interface would be a window with your corpus in. The corpus would
be morphologically analysed with your .dix file and you would be able to
see analysed words, and look up their paradigm(s), and for unanalysed
word forms you would be given a drop down box with paradigms that match.
Perhaps using something like[1]. Clicking on the paradigm in the
dropdown would add an entry to the dictionary, recompile, recalculate
coverage etc. You wouldn't be able to add new paradigms. There would
also be an option, given a known lemma, to show all the forms that match
surface forms of that lemma+paradigm in a concordance. This would be
written in python3 + gtk. Sentences in the corpus can be ordered by the
combined frequency of their words, so you can see the sentences with the
words which will improve your coverage best at the top.
3) Improved bilingual dictionary induction. Use case: you have two
morphological analysers, but no bilingual dictionary. But, you have a
parallel corpus. For example: Romanian-French. You can analyse the
corpus, and use some word-aligner (Giza++) to get word alignments, but
you can't make the bidix entries directly from that. The user will have
to specify models for bidix entries which map SL-paradigm : TL-paradigm.
When building the bilingual dictionary, any alignment for which the SL
word's paradigm doesn't have a template with the TL word's paradigm will
be discarded. E.g.
fr:
<e lm="temps"><i>temps</i><par n="mois__n"/></e>
ro:
<e lm="timp" a="mioara"><i>timp</i><par n="timp__n"/></e>
<e lm="vreme" r="LR"><i>vrem</i><par n="vrem/e__n"/></e>
Let's suppose we find in the alignments:
temps:timp
temps:vreme
We will need patterns to match forms in mois__n to forms in timp__n and
forms in mois__n to forms in vrem/e__n .
There will be a script to extract the most frequent combinations of
paradigms in SL-TL, so the user can prioritise which templates to make.
So, generating the bidix would be done in an incremental fashion. A lot
of the noise of the alignment process can be filtered out by disallowing
combinations of words because of no existing paradigm-paradigm model
(e.g. mois__n to cu__pr)
If anyone has any comments I'd love to hear them :)
Fran
1. http://wiki.apertium.org/wiki/Improved_corpus-based_paradigm_matching
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff