Now I answer Jacob's comments.
Coool. A research project.
So this sounds exactly like an N-gram tagger but you throw out HMM and throw in FST ?
Hmm, not really. See my comment below.
And you would be able to re-use the existing forbid rules we have now?
That's the idea. It takes a bit of thinking.

Right now we use a 2-gram tagger which, to me, using the wording of your paper (http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf <http://www.dlsi.ua.es/%7Emlf/docum/sanchezvillamil04p.pdf>) sounds like 'left context size=1, right context size=0'
Not really. In a 2-gram HMM tagger, states are tags and probabilities are for tag-to-tag transitions and tag-to-ambiguity class emissions. You read ambiguity classes, and you build all possible state paths and score them with a probability model. When you finish a sentence or, better, when you hit a non-ambiguous word, you can select the path with the best score (that is, the best tag sequence) and go on. If you hit a stretch of seven ambiguous words, you don't make a decision until you hit a non-ambiguous word. That is why HMMs can only be approximated but never represented as finite-state machines.

In HMMs, forbid rules are zero transition probabilities for two tag sequences, that is, forbidden 2-grams.

In SWPoST (sliding-window part-of-speech taggers), decisions are not delayed. For instance, in i.e. 'left context size=1, right context size=1', once a 3-gram of _ambiguity classes_ is detected, the word in the middle is assigned a tag. No need to wait for non-ambiguous words to make a decision.

Forbids could be applied before training (so that the wrong tag sequences are never seen in training) or after (they would work nicely when the context contains non-ambiguous words). I have to think about it.


I have tried using the 3-gram HMM tagger (i.e. 'left context size=2, right context size=0' ?) made by a GSoC student 3 years ago, but the results werent better than the 2gram for English->Esperanto. So I stayed with the (not very good) 2gram tagger.

What you call the 2-gram HMM tagger is more powerful than the 'left context size=1, right context size=0' sliding-window tagger.
One of the problems of a 3gram tagger is that the number of possible parameters to vary in the HMM will explode. Processing gets slower and you need a much bigger training set to do unsupervised tagger training.
The number of parameters of the SWPoST is large but, as it may be encoded as a series of decision rules, it may be made very compact. And, by the way, Fran, it would be very similar to a Brill tagger, and could be integrated.

In the end of the article you write: '
We are currently studying ways to reduce further the number of states and
transitions at a small price in tagging accuracy, by using probabilistic criteria to prune uncommon contexts which do not contribute significantly to the overall
accuracy'

So.... did you find out anything?  ;-)
No. The first author worked for me in a project. He took some decisions about copyright and licensing that I hadn't approved. I told him this was unacceptable and he got angry decided he didn't want to work with me anymore. The result: work abandoned. And the code is nowhere to be found.





    > (3) A preprocessor or compiler to avoid having to write structural
    > transfer


[...]


    si ( ( ( ($1.orig.gen!="<mf>")
            i($2.orig.gen!="<mf>")
            i($1.orig.gen==$2.orig.gen) )
          o( ($1.orig.gen=="<mf>")
            o($2.orig.gen=="<mf>") ) )
       i ( ( ($1.orig.nbr!="<sp>")
            i($2.orig.nbr!="<sp>")
            i($1.orig.nbr==$2.orig.nbr) )
          o( ($1.orig.nbr=="<sp>")
            o($2.orig.nbr=="<sp>") ) ) )

    is indeed much easier to understand than our current transfer XML
    format.

Easier to write too.


    And yes, this would be a fine project I think. One of the challenges
    would be writing the validator though.

No validator needed. As I mentioned for dix files, one needs to write a proper compiler.

As far as I understand there is an 1-to-1 mapping of our current transfer XML format, so validation could be done by using the converter and then validate the XML file.
The converter (compiler) would choke on errors. The trip back would be validation followed by XSLT transformation.

It will probably be hard to convert all facets of the .dix files into this format. So - round trip would loose stuff in some cased.


    imho, this is basically reinventing lexc -- were there validators
    available for it ?


As far as I understand there is a mapping of our current transfer XML format, so validation could be done by using the converter and then validate the XML file.
Vide supra.




    > (6) Tools to order .dixes and point at "bad coding style" (which
    would
    > have to be defined). My collection is that the current .dix
    format is
    > too powerful and allows almost anything. I have to think more about
    > this idea, but I couldn't help throwing it out at you.

    We have the idea "lint for Apertium " which is quite similar to this
    one.

    
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium


Yes, this is 'lint' again. Great idea.
Much more than lint, as I told Fran.

Cheers

Mikel

--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to