Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Mikel L. Forcada Wed, 13 Mar 2013 01:29:12 -0700

Now I answer Jacob's comments.

Coool. A research project.
So this sounds exactly like an N-gram tagger but you throw out HMM andthrow in FST ?

Hmm, not really. See my comment below.

And you would be able to re-use the existing forbid rules we have now?

That's the idea. It takes a bit of thinking.

Right now we use a 2-gram tagger which, to me, using the wording ofyour paper (http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf<http://www.dlsi.ua.es/%7Emlf/docum/sanchezvillamil04p.pdf>) soundslike 'left context size=1, right context size=0'

Not really. In a 2-gram HMM tagger, states are tags and probabilitiesare for tag-to-tag transitions and tag-to-ambiguity class emissions. Youread ambiguity classes, and you build all possible state paths and scorethem with a probability model. When you finish a sentence or, better,when you hit a non-ambiguous word, you can select the path with the bestscore (that is, the best tag sequence) and go on. If you hit a stretchof seven ambiguous words, you don't make a decision until you hit anon-ambiguous word. That is why HMMs can only be approximated but neverrepresented as finite-state machines.

In HMMs, forbid rules are zero transition probabilities for two tagsequences, that is, forbidden 2-grams.

In SWPoST (sliding-window part-of-speech taggers), decisions are notdelayed. For instance, in i.e. 'left context size=1, right contextsize=1', once a 3-gram of _ambiguity classes_ is detected, the word inthe middle is assigned a tag. No need to wait for non-ambiguous words tomake a decision.

Forbids could be applied before training (so that the wrong tagsequences are never seen in training) or after (they would work nicelywhen the context contains non-ambiguous words). I have to think about it.

I have tried using the 3-gram HMM tagger (i.e. 'left context size=2,right context size=0' ?) made by a GSoC student 3 years ago, but theresults werent better than the 2gram for English->Esperanto. So Istayed with the (not very good) 2gram tagger.

What you call the 2-gram HMM tagger is more powerful than the 'leftcontext size=1, right context size=0' sliding-window tagger.

One of the problems of a 3gram tagger is that the number of possibleparameters to vary in the HMM will explode. Processing gets slower andyou need a much bigger training set to do unsupervised tagger training.

The number of parameters of the SWPoST is large but, as it may beencoded as a series of decision rules, it may be made very compact. And,by the way, Fran, it would be very similar to a Brill tagger, and couldbe integrated.

In the end of the article you write: '
We are currently studying ways to reduce further the number of states and
transitions at a small price in tagging accuracy, by usingprobabilistic criteria toprune uncommon contexts which do not contribute significantly to theoverall
accuracy'

So.... did you find out anything?  ;-)

No. The first author worked for me in a project. He took some decisionsabout copyright and licensing that I hadn't approved. I told him thiswas unacceptable and he got angry decided he didn't want to work with meanymore. The result: work abandoned. And the code is nowhere to be found.






    > (3) A preprocessor or compiler to avoid having to write structural
    > transfer


[...]



    si ( ( ( ($1.orig.gen!="<mf>")
            i($2.orig.gen!="<mf>")
            i($1.orig.gen==$2.orig.gen) )
          o( ($1.orig.gen=="<mf>")
            o($2.orig.gen=="<mf>") ) )
       i ( ( ($1.orig.nbr!="<sp>")
            i($2.orig.nbr!="<sp>")
            i($1.orig.nbr==$2.orig.nbr) )
          o( ($1.orig.nbr=="<sp>")
            o($2.orig.nbr=="<sp>") ) ) )

    is indeed much easier to understand than our current transfer XML
    format.

Easier to write too.



    And yes, this would be a fine project I think. One of the challenges
    would be writing the validator though.

No validator needed. As I mentioned for dix files, one needs to write aproper compiler.

As far as I understand there is an 1-to-1 mapping of our currenttransfer XML format, so validation could be done by using theconverter and then validate the XML file.

The converter (compiler) would choke on errors. The trip back would bevalidation followed by XSLT transformation.

It will probably be hard to convert all facets of the .dix files intothis format. So - round trip would loose stuff in some cased.

    imho, this is basically reinventing lexc -- were there validators
    available for it ?
As far as I understand there is a mapping of our current transfer XMLformat, so validation could be done by using the converter and thenvalidate the XML file.

Vide supra.





    > (6) Tools to order .dixes and point at "bad coding style" (which
    would
    > have to be defined). My collection is that the current .dix
    format is
    > too powerful and allows almost anything. I have to think more about
    > this idea, but I couldn't help throwing it out at you.

    We have the idea "lint for Apertium " which is quite similar to this
    one.

    
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium


Yes, this is 'lint' again. Great idea.

Much more than lint, as I told Fran.

Cheers

Mikel

--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Reply via email to