Gang,
I have made changes to your text. You'll see my comments here:
http://piratepad.net/gang-chen-gsoc-2013
After that I've seen your P.S. below. I'll paste it somewhere and
comment on it. Hold on.
Mikel
------------------------------------------------------------------------
OK, I can't wait to share with you the example that helped me to go
throught all the maths in the unserpervised part of the paper.
Suppose that we only have two sentences:
A ^B/x/y$ C
A ^D/x/y/z$ C
where A, B and C are all words, x, y, and z are possible tags for the
context A_C, and we only focus on this context.
At the initial stage,
n_0 (A_x_C) = 1 * 1/2 + 1 * 1/3 = 5/6
n_0 (A_y_C) = 1 * 1/2 + 1 * 1/3 = 5/6
n_0 (A_z_C) = 1 * 1/3 = 1/3
where n_0 (A_x_C) donates the estimated count that context A_C should
tag the middle word as 'x', at the 0-th iteration.
then using equation (10) in the paper, the iteration begins with the
1-st iteration,
n_1 (A_x_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3) ) = 11/12
n_1 (A_y_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3) ) = 11/12
n_1 (A_z_C) = 5/6 * ( 1 * 1/(5/6 + 5/6 + 1/3) ) = 1/6
and to the 2-nd iteration:
n_2 (A_x_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 + 11/12 +
1/6) ) = 23/24
n_2 (A_y_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 + 11/12 +
1/6) ) = 23/24
n_2 (A_z_C) = 11/12 * ( 1 * 1/(11/12 + 11/12 + 1/6) ) = 1/12
...
In this way, the wheels are running!
So finally the A_C context will outupt either 'x' or 'y' as the best
tag, which is in consistancy with the intuitive.
I guess you must be happy when you first invented the algorithm :)
So far, the finite-state transducer *minimization* part still remains a
problem to me. I think I still need to spare some time to learn about it.
-----------------------------------------------------------split
line-----------------------------------------------------------------------------------------------------------
The following is the reply to your last mail.
> For that, you will have to study the current .tsx format and make
sense of it, as your tagger will use exactly that format.
For the TXS format you mentioned, I've make sense of it, by reading the
en-es package's example.
> Forbid rules can be applied to the input text before actually training
or running the tagger. You will also need to find a good way to store
probabilities or turn them into rules which can be read and perhaps
edited using linguistic knowledge.
To be honest, I didn't quite catch your point here.
For example, we have a FORBID rule that forbids the tag sequence "a x",
and we have training text as:
^A/a/b$ ^B/x/y$
My question is:
It seems that we can't just drop 'a' in A and 'x' in B in the input
text, because "a y" and "b x" are not forbidden. If 'a' of A and 'x' of
B have been dropped together, we will never have "a y" and "b x". so how
to apply forbid rules *before* training?
However, I came up with a way that seems can apply forbid rule *during
tagging*. Let me explain.
For example, if we have a forbid rule as "a x", and the input sentence
to be tagged is as following:
^A/a/b$ ^B/x/y$
Firstly, A has been tagged as 'a' with the help of context #_B (# is the
sentence start). So, for B, the candidate 'x' is directly FORBIDDEN.
This is what I mean by "during tagging".
However, there seems to be a problem with this appoach, that the forbid
pair "a x" still occupies some probability during and after the training
procedure. This might affect the precision of the tagger? What do you
think of it?
Look forward to your reply!
Best,
Gang
2013/4/21 Mikel Forcada <[email protected] <mailto:[email protected]>>
Gang,
great stuff; I haven't checked it exhaustively but as far as I am
testing it seems to behave as expected.
Now it is time to move on to preparing your application. For that,
you will have to study the current .tsx format and make sense of it,
as your tagger will use exactly that format.
Forbid rules can be applied to the input text before actually
training or running the tagger. You will also need to find a good
way to store probabilities or turn them into rules which can be read
and perhaps edited using linguistic knowledge.
Please do not hesitate to ask any questions to me or to the list.
Best,
Mikel
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/
<http://www.dlsi.ua.es/%7Emlf/>)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes InformĂ tics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff