Re: [Apertium-stuff] Coding Challenge for idea "Sliding-window part-of-speech tagger"

Mikel Forcada Wed, 24 Apr 2013 09:46:07 -0700

Gang,
my second round of responses is http://piratepad.net/gang-s-notes I have
to think harder because I don't know how to solve the problem you pose.


Mikel

Al 04/24/2013 05:57 PM, En/na Gang Chen ha escrit:
> Hi，Mikel，
>
> Thank you for your guidance!
>
> During the last 2 days, I was mainly focused on reading the paper and
> writing my application. The good news are that I understand the
> unsupervised traning alroghtm, which I think is indeed the most
> mathematically heavy part, and that the first draft of application is
> done :)
>
> First of all, would you please have a look over the application draft?
> Your advices are always welcome! Please see here:
> https://github.com/elephantgcc/gsoc-2013/blob/master/Application
>
>
> -----------------------------------------------------------split
> line--------------------------------------------------------------------------------------------------------------------
>
> OK, I can't wait to share with you the example that helped me to go
> throught all the maths in the unserpervised part of the paper.
>
> Suppose that we only have two sentences:
> A ^B/x/y$ C
> A ^D/x/y/z$ C
>
> where A, B and C are all words, x, y, and z are possible tags for the
> context A_C, and we only focus on this context.
>
> At the initial stage,
>
> n_0 (A_x_C) = 1 * 1/2 + 1 * 1/3 = 5/6
> n_0 (A_y_C) = 1 * 1/2 + 1 * 1/3 = 5/6
> n_0 (A_z_C) = 1 * 1/3 = 1/3
>
> where n_0 (A_x_C) donates the estimated count that context A_C should
> tag the middle word as 'x', at the 0-th iteration.
>
> then using equation (10) in the paper, the iteration begins with the
> 1-st iteration,
>
> n_1 (A_x_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3) ) =
> 11/12
> n_1 (A_y_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3) ) =
> 11/12
> n_1 (A_z_C) = 5/6 * ( 1 * 1/(5/6 + 5/6 + 1/3) ) = 1/6
>
> and to the 2-nd iteration:
>
> n_2 (A_x_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 + 11/12 +
> 1/6) ) = 23/24
> n_2 (A_y_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 + 11/12 +
> 1/6) ) = 23/24
> n_2 (A_z_C) = 11/12 * ( 1 * 1/(11/12 + 11/12 + 1/6) ) = 1/12
>
> ...
>
> In this way, the wheels are running!
>
> So finally the A_C context will outupt either 'x' or 'y' as the best
> tag, which is in consistancy with the intuitive.
>
> I guess you must be happy when you first invented the algorithm :)
>
> So far, the finite-state transducer *minimization* part still remains
> a problem to me. I think I still need to spare some time to learn
> about it.
>
> -----------------------------------------------------------split
> line-----------------------------------------------------------------------------------------------------------
>
> The following is the reply to your last mail.
>
> > For that, you will have to study the current .tsx format and make
> sense of it, as your tagger will use exactly that format.
>
> For the TXS format you mentioned, I've make sense of it, by reading
> the en-es package's example.
>
> > Forbid rules can be applied to the input text before actually
> training or running the tagger. You will also need to find a good way
> to store probabilities or turn them into rules which can be read and
> perhaps edited using linguistic knowledge.
>
> To be honest, I didn't quite catch your point here.
>
> For example, we have a FORBID rule that forbids the tag sequence "a
> x", and we have training text as:
>
> ^A/a/b$ ^B/x/y$
>
> My question is:
>
> It seems that we can't just drop 'a' in A and 'x' in B in the input
> text, because "a y" and "b x" are not forbidden. If 'a' of A and 'x'
> of B have been dropped together, we will never have "a y" and "b x".
> so how to apply forbid rules *before* training?
>
> However, I came up with a way that seems can apply forbid rule *during
> tagging*. Let me explain.
>
> For example, if we have a forbid rule as "a x", and the input sentence
> to be tagged is as following:
> ^A/a/b$ ^B/x/y$
>
> Firstly, A has been tagged as 'a' with the help of context #_B (# is
> the sentence start). So, for B, the candidate 'x' is directly
> FORBIDDEN. This is what I mean by "during tagging".
>
> However, there seems to be a problem with this appoach, that the
> forbid pair "a x" still occupies some probability during and after the
> training procedure. This might affect the precision of the tagger?
> What do you think of it?
>
>
> Look forward to your reply!
>
>
>
> Best,
>
> Gang
>
>
>
>
>
> 2013/4/21 Mikel Forcada <[email protected] <mailto:[email protected]>>
>
>     Gang,
>
>     great stuff; I haven't checked it exhaustively but as far as I am
>     testing it seems to behave as expected.
>
>     Now it is time to move on to preparing your application. For that,
>     you will have to study the current .tsx format and make sense of
>     it, as your tagger will use exactly that format.
>
>     Forbid rules can be applied to the input text before actually
>     training or running the tagger. You will also need to find a good
>     way to store probabilities or turn them into rules which can be
>     read and perhaps edited using linguistic knowledge.
>
>     Please do not hesitate to ask any questions to me or to the list.
>
>     Best,
>
>     Mikel
>
>
>     -- 
>     Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/ 
> <http://www.dlsi.ua.es/%7Emlf/>)
>     Departament de Llenguatges i Sistemes Informàtics
>     Universitat d'Alacant
>     E-03071 Alacant, Spain
>     Phone: +34 96 590 9776
>     Fax: +34 96 590 9326
>
>


-- 
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Coding Challenge for idea "Sliding-window part-of-speech tagger"

Reply via email to