Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Jacob Nordfalk Tue, 12 Mar 2013 13:54:44 -0700

Thanks to both of you for some great ideas!!


2013/3/12 Francis Tyers <[email protected]>

> El dt 12 de 03 de 2013 a les 18:19 +0100, en/na Mikel L. Forcada va
> escriure:
>
> > (1) Sliding-window part-of-speech tagger. The idea is to implement the
> > unsupervised part-of-speech tagger
> > (
> http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging)
> as a drop-in replacement for the current hidden-Markov-model tagger.
> Ideally, it should have support for unknown words, and also for "forbid"
> descriptions (not described in the paper). The tagger has a very intuitive
> interpretation (believe me, even if you find the maths a bit daunting). I
> am available for questions (I invented the tagger, I should be able to
> remember!).
>
> I think this would make a great project, we really need improved
> morphological disambiguation in canonical Apertium.  It's particularly
> nice in that it can be represented as an FST (hopefully with
> lttoolbox).
>
> I'm not sure if I quite understand the results in the paper though. The
> performance of the tagger was better than bigram HMM trained with
> Baum-Welch, but even then had a ~35% error rate ?
>

Coool. A research project.
So this sounds exactly like an N-gram tagger but you throw out HMM and
throw in FST ?
And you would be able to re-use the existing forbid rules we have now?

Right now we use a 2-gram tagger which, to me, using the wording of your
paper (http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf) sounds like
'left context size=1, right context size=0'

I have tried using the 3-gram HMM tagger (i.e. 'left context size=2, right
context size=0' ?) made by a GSoC student 3 years ago, but the results
werent better than the 2gram for English->Esperanto. So I stayed with the
(not very good) 2gram tagger.

One of the problems of a 3gram tagger is that the number of possible
parameters to vary in the HMM will explode. Processing gets slower and you
need a much bigger training set to do unsupervised tagger training.

In the end of the article you write: '
We are currently studying ways to reduce further the number of states and
transitions at a small price in tagging accuracy, by using probabilistic
criteria to
prune uncommon contexts which do not contribute significantly to the overall
accuracy'

So.... did you find out anything?  ;-)





>
> > (3) A preprocessor or compiler to avoid having to write structural
> > transfer (i.e., .t1x, .t2x and .t3x) rules in raw XML which is very
> > overt and clear, but clumsy and hard to write. Before Apertium, in
> > interNOSTRUM.com we had a language for .t1x-style files called
> > MorphTrans, which is described in the
> > paper
> http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/3355/1843.
>  I believe this language is much easier to write; it should be upgraded
> and documented. The preprocessor would read .mt1, .mt2, and .mt3 files in
> MorphTrans-style format (with keywords in English) and generate the current
> XML. There would also be the opposite tool (much easier to write as an XSLT
> stylesheet) to generate MorphTrans-style code from current XML code.
> Morphtrans can of course be redesigned a bit, and, in fact, it should.
>
> I love the "si ... altrament" :D
>

I like it as well, and I think

si ( ( ( ($1.orig.gen!="<mf>")
        i($2.orig.gen!="<mf>")
        i($1.orig.gen==$2.orig.gen) )
      o( ($1.orig.gen=="<mf>")
        o($2.orig.gen=="<mf>") ) )
   i ( ( ($1.orig.nbr!="<sp>")
        i($2.orig.nbr!="<sp>")
        i($1.orig.nbr==$2.orig.nbr) )
      o( ($1.orig.nbr=="<sp>")
        o($2.orig.nbr=="<sp>") ) ) )

is indeed much easier to understand than our current transfer XML format.



>
> And yes, this would be a fine project I think. One of the challenges
> would be writing the validator though.
>

As far as I understand there is an 1-to-1 mapping of our current transfer
XML format, so validation could be done by using the converter and then
validate the XML file.





>
> > (3') The same for .dix files. Two roundtrip converters to use the old
> > interNOSTRUM-style format
> > (http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf), which is
> > much easier to write.
>

It will probably be hard to convert all facets of the .dix files into this
format. So - round trip would loose stuff in some cased.


>
> imho, this is basically reinventing lexc -- were there validators
> available for it ?
>

As far as I understand there is a mapping of our current transfer XML
format, so validation could be done by using the converter and then
validate the XML file.



>
> > (6) Tools to order .dixes and point at "bad coding style" (which would
> > have to be defined). My collection is that the current .dix format is
> > too powerful and allows almost anything. I have to think more about
> > this idea, but I couldn't help throwing it out at you.
>
> We have the idea "lint for Apertium " which is quite similar to this
> one.
>
>
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium
>
>
Yes, this is 'lint' again. Great idea.


-- 
Jacob Nordfalk <http://profiles.google.com/jacob.nordfalk>
javabog.dk
Androidudvikler og -underviser på
DTU<http://cv.ihk.dk/diplomuddannelser/itd/vf/MAU>og
Lund&Bendsen <https://www.lundogbendsen.dk/undervisning/beskrivelse/LB1809/>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Reply via email to