Now I answer Jacob's comments.
Coool. A research project.
So this sounds exactly like an N-gram tagger but you throw out HMM and
throw in FST ?
Hmm, not really. See my comment below.
And you would be able to re-use the existing forbid rules we have now?
That's the idea. It takes a bit of thinking.
Right now we use a 2-gram tagger which, to me, using the wording of
your paper (http://www.dlsi.ua.es/~mlf/docum/sanchezvillamil04p.pdf
<http://www.dlsi.ua.es/%7Emlf/docum/sanchezvillamil04p.pdf>) sounds
like 'left context size=1, right context size=0'
Not really. In a 2-gram HMM tagger, states are tags and probabilities
are for tag-to-tag transitions and tag-to-ambiguity class emissions. You
read ambiguity classes, and you build all possible state paths and score
them with a probability model. When you finish a sentence or, better,
when you hit a non-ambiguous word, you can select the path with the best
score (that is, the best tag sequence) and go on. If you hit a stretch
of seven ambiguous words, you don't make a decision until you hit a
non-ambiguous word. That is why HMMs can only be approximated but never
represented as finite-state machines.
In HMMs, forbid rules are zero transition probabilities for two tag
sequences, that is, forbidden 2-grams.
In SWPoST (sliding-window part-of-speech taggers), decisions are not
delayed. For instance, in i.e. 'left context size=1, right context
size=1', once a 3-gram of _ambiguity classes_ is detected, the word in
the middle is assigned a tag. No need to wait for non-ambiguous words to
make a decision.
Forbids could be applied before training (so that the wrong tag
sequences are never seen in training) or after (they would work nicely
when the context contains non-ambiguous words). I have to think about it.
I have tried using the 3-gram HMM tagger (i.e. 'left context size=2,
right context size=0' ?) made by a GSoC student 3 years ago, but the
results werent better than the 2gram for English->Esperanto. So I
stayed with the (not very good) 2gram tagger.
What you call the 2-gram HMM tagger is more powerful than the 'left
context size=1, right context size=0' sliding-window tagger.
One of the problems of a 3gram tagger is that the number of possible
parameters to vary in the HMM will explode. Processing gets slower and
you need a much bigger training set to do unsupervised tagger training.
The number of parameters of the SWPoST is large but, as it may be
encoded as a series of decision rules, it may be made very compact. And,
by the way, Fran, it would be very similar to a Brill tagger, and could
be integrated.
In the end of the article you write: '
We are currently studying ways to reduce further the number of states and
transitions at a small price in tagging accuracy, by using
probabilistic criteria to
prune uncommon contexts which do not contribute significantly to the
overall
accuracy'
So.... did you find out anything? ;-)
No. The first author worked for me in a project. He took some decisions
about copyright and licensing that I hadn't approved. I told him this
was unacceptable and he got angry decided he didn't want to work with me
anymore. The result: work abandoned. And the code is nowhere to be found.
> (3) A preprocessor or compiler to avoid having to write structural
> transfer
[...]
si ( ( ( ($1.orig.gen!="<mf>")
i($2.orig.gen!="<mf>")
i($1.orig.gen==$2.orig.gen) )
o( ($1.orig.gen=="<mf>")
o($2.orig.gen=="<mf>") ) )
i ( ( ($1.orig.nbr!="<sp>")
i($2.orig.nbr!="<sp>")
i($1.orig.nbr==$2.orig.nbr) )
o( ($1.orig.nbr=="<sp>")
o($2.orig.nbr=="<sp>") ) ) )
is indeed much easier to understand than our current transfer XML
format.
Easier to write too.
And yes, this would be a fine project I think. One of the challenges
would be writing the validator though.
No validator needed. As I mentioned for dix files, one needs to write a
proper compiler.
As far as I understand there is an 1-to-1 mapping of our current
transfer XML format, so validation could be done by using the
converter and then validate the XML file.
The converter (compiler) would choke on errors. The trip back would be
validation followed by XSLT transformation.
It will probably be hard to convert all facets of the .dix files into
this format. So - round trip would loose stuff in some cased.
imho, this is basically reinventing lexc -- were there validators
available for it ?
As far as I understand there is a mapping of our current transfer XML
format, so validation could be done by using the converter and then
validate the XML file.
Vide supra.
> (6) Tools to order .dixes and point at "bad coding style" (which
would
> have to be defined). My collection is that the current .dix
format is
> too powerful and allows almost anything. I have to think more about
> this idea, but I couldn't help throwing it out at you.
We have the idea "lint for Apertium " which is quite similar to this
one.
http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium
Yes, this is 'lint' again. Great idea.
Much more than lint, as I told Fran.
Cheers
Mikel
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff