Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Mikel L. Forcada Tue, 12 Mar 2013 10:19:54 -0700

Hi Apertiumers:
I have some more ideas to add to Fran's. And there are a few more I cannot
remember now, so I may come back. My ideas are usually outlandish and
challenging, and Fran usually says they are not proper GSoC projects but
hey, don't we want the best students?


(1) Sliding-window part-of-speech tagger. The idea is to implement the
unsupervised part-of-speech tagger (
http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging)
as a drop-in replacement for the current hidden-Markov-model tagger.
Ideally, it should have support for unknown words, and also for "forbid"
descriptions (not described in the paper). The tagger has a very intuitive
interpretation (believe me, even if you find the maths a bit daunting). I
am available for questions (I invented the tagger, I should be able to
remember!).

(2) Improving the web-based dictionary maintenance tool developed by Daniel
Torregrosa-Rivero (http://apertium.vm.bytemark.co.uk/simpledix/): create
configuration files for other language pairs and entry types, etc.  The
code is available at:
http://apertium.svn.sf.net/viewvc/apertium/trunk/apertium-simpledix/ . This
is related, I think, to Fran's 2). This is, to my knowledge, the most
promising alternative to editing XML .dix files directly to add simple
entries, but I might be wrong.

(3) A preprocessor or compiler to avoid having to write structural transfer
(i.e., .t1x, .t2x and .t3x) rules in raw XML which is very overt and clear,
but clumsy and hard to write. Before Apertium, in interNOSTRUM.com we had a
language for .t1x-style files called MorphTrans, which is described in the
paper
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/3355/1843.
I believe this language is much easier to write; it should be upgraded
and documented. The preprocessor would read .mt1, .mt2, and .mt3 files in
MorphTrans-style format (with keywords in English) and generate the current
XML. There would also be the opposite tool (much easier to write as an XSLT
stylesheet) to generate MorphTrans-style code from current XML code.
Morphtrans can of course be redesigned a bit, and, in fact, it should.

(3') The same for .dix files. Two roundtrip converters to use the old
interNOSTRUM-style format (
http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf), which is much
easier to write.

(4) One step beyond (3): a visual interface to writing structural transfer
rules. One would have to invent something, starting perhaps with a visual
rendering of block structure in the original XML language: how about
something like Scratch (http://scratch.mit.edu/), where jigsaw-puzzle-style
pieces only fit if the syntax is right...?

(5) Extending the .dix language (and modifying lt-proc or writing a
pre-processor to it) to be able to deal with the kind of stuff that some
people miss in the .dix (and .metadix) formats and makes them use HFST
which means that people have to mix two different dictionary formats in the
same language pair. And yes, of course, having something that translates
the current HFST format to the new superdix format. Yes, you guessed, I'd
love to throw HFST off board. I can tolerate it as a temporary heresy to
keep the church of Apertium together, but, as co-pope [1], I'd like to
canonicalize Apertium in the end. And it would be easier to deal with
prefixes hey Jonathan?

(6) Tools to order .dixes and point at "bad coding style" (which would have
to be defined). My collection is that the current .dix format is too
powerful and allows almost anything. I have to think more about this idea,
but I couldn't help throwing it out at you.

I think that is enough for the moment, don't you folks think?

In connection with Fran's 3), one could perhaps take a look at Retratos
http://sourceforge.net/projects/retratos/ . I can talk to Helena de
Medeiros Caseli who is the admin of that project.

Cheers

Mikel

[1] It does not matter who the real Apertium pope is. It's always going to
be "who's that guy in white next to Fran Tyers?".


2013/3/12 Francis Tyers <[email protected]>

> To try and get the ball rolling... we've got less than a week left ...
> Here are some ideas that I had for GSOC:
>
> 1) Combining Brill-tagger style transformation-based learning and
> Felipe's "supervised to unsupervised with fractional counts" to
> automatically generate constraint grammar (or constraint-grammar style)
> rules for morphological disambiguation. This would involve 1-way and
> n-way training (e.g. using >1 system to learn with).
>
> 2) An interface for working with .dix files. Not the typical "click here
> to add a word" interface, but something for more advanced users. The
> main interface would be a window with your corpus in. The corpus would
> be morphologically analysed with your .dix file and you would be able to
> see analysed words, and look up their paradigm(s), and for unanalysed
> word forms you would be given a drop down box with paradigms that match.
> Perhaps using something like[1]. Clicking on the paradigm in the
> dropdown would add an entry to the dictionary, recompile, recalculate
> coverage etc. You wouldn't be able to add new paradigms. There would
> also be an option, given a known lemma, to show all the forms that match
> surface forms of that lemma+paradigm in a concordance.  This would be
> written in python3 + gtk. Sentences in the corpus can be ordered by the
> combined frequency of their words, so you can see the sentences with the
> words which will improve your coverage best at the top.
>
> 3) Improved bilingual dictionary induction. Use case: you have two
> morphological analysers, but no bilingual dictionary. But, you have a
> parallel corpus. For example: Romanian-French. You can analyse the
> corpus, and use some word-aligner (Giza++) to get word alignments, but
> you can't make the bidix entries directly from that. The user will have
> to specify models for bidix entries which map SL-paradigm : TL-paradigm.
> When building the bilingual dictionary, any alignment for which the SL
> word's paradigm doesn't have a template with the TL word's paradigm will
> be discarded. E.g.
>
> fr:
>       <e lm="temps"><i>temps</i><par n="mois__n"/></e>
> ro:
>     <e lm="timp" a="mioara"><i>timp</i><par n="timp__n"/></e>
>     <e lm="vreme" r="LR"><i>vrem</i><par n="vrem/e__n"/></e>
>
> Let's suppose we find in the alignments:
>
> temps:timp
> temps:vreme
>
> We will need patterns to match forms in mois__n to forms in timp__n and
> forms in mois__n to forms in vrem/e__n .
>
> There will be a script to extract the most frequent combinations of
> paradigms in SL-TL, so the user can prioritise which templates to make.
> So, generating the bidix would be done in an incremental fashion. A lot
> of the noise of the alignment process can be filtered out by disallowing
> combinations of words because of no existing paradigm-paradigm model
> (e.g. mois__n to cu__pr)
>
> If anyone has any comments I'd love to hear them :)
>
> Fran
>
> 1. http://wiki.apertium.org/wiki/Improved_corpus-based_paradigm_matching
>
>
>
> ------------------------------------------------------------------------------
> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
> endpoint security space. For insight on selecting the right partner to
> tackle endpoint security challenges, access the full report.
> http://p.sf.net/sfu/symantec-dev2dev
> _______________________________________________
> Apertium-stuff mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>


-- 
Mikel L. Forcada                    E-mail: [email protected]
Departament de Llenguatges          Phone: +34-96-590-9776
i Sistemes Informàtics                also +34-96-590-3772.
UNIVERSITAT D'ALACANT               Fax:   +34-96-590-9326, -3464
E-03071 ALACANT, Spain.

URL: http://www.dlsi.ua.es/~mlf

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Reply via email to