Re: [Apertium-stuff] Coding Challenge for idea "Sliding-window part-of-speech tagger"

Mikel Forcada Wed, 24 Apr 2013 10:04:24 -0700

Sleep well, Gang.
I found this paper of ours
http://altea.dlsi.ua.es/~mlf/docum/sanchezvillamil05p.pdf where we use a
different model. As it works on tag space and not on ambiguity class
space, perhaps it can be used to implement forbids. I have to check...


Mikel

Al 04/24/2013 07:01 PM, En/na Gang Chen ha escrit:
> Hi, MIkel,
>
> Thanks for your prompt and informative reply!
>
> I will think carefully about the questions and make updates to the
> application.
>
> Because of time difference, I have to get some rest.
>
>
> Best,
>
> Gang
>
>
>
> 2013/4/25 Mikel Forcada <[email protected] <mailto:[email protected]>>
>
>     Gang,
>     my second round of responses is http://piratepad.net/gang-s-notes
>     I have to think harder because I don't know how to solve the
>     problem you pose.
>
>     Mikel
>
>     Al 04/24/2013 05:57 PM, En/na Gang Chen ha escrit:
>>     Hi，Mikel，
>>
>>     Thank you for your guidance!
>>
>>     During the last 2 days, I was mainly focused on reading the paper
>>     and writing my application. The good news are that I understand
>>     the unsupervised traning alroghtm, which I think is indeed the
>>     most mathematically heavy part, and that the first draft of
>>     application is done :)
>>
>>     First of all, would you please have a look over the application
>>     draft? Your advices are always welcome! Please see here:
>>     https://github.com/elephantgcc/gsoc-2013/blob/master/Application
>>
>>
>>     -----------------------------------------------------------split
>>     
>> line--------------------------------------------------------------------------------------------------------------------
>>
>>     OK, I can't wait to share with you the example that helped me to
>>     go throught all the maths in the unserpervised part of the paper.
>>
>>     Suppose that we only have two sentences:
>>     A ^B/x/y$ C
>>     A ^D/x/y/z$ C
>>
>>     where A, B and C are all words, x, y, and z are possible tags for
>>     the context A_C, and we only focus on this context.
>>
>>     At the initial stage,
>>
>>     n_0 (A_x_C) = 1 * 1/2 + 1 * 1/3 = 5/6
>>     n_0 (A_y_C) = 1 * 1/2 + 1 * 1/3 = 5/6
>>     n_0 (A_z_C) = 1 * 1/3 = 1/3
>>
>>     where n_0 (A_x_C) donates the estimated count that context A_C
>>     should tag the middle word as 'x', at the 0-th iteration.
>>
>>     then using equation (10) in the paper, the iteration begins with
>>     the 1-st iteration,
>>
>>     n_1 (A_x_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3)
>>     ) = 11/12
>>     n_1 (A_y_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3)
>>     ) = 11/12
>>     n_1 (A_z_C) = 5/6 * ( 1 * 1/(5/6 + 5/6 + 1/3) ) = 1/6
>>
>>     and to the 2-nd iteration:
>>
>>     n_2 (A_x_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 +
>>     11/12 + 1/6) ) = 23/24
>>     n_2 (A_y_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 +
>>     11/12 + 1/6) ) = 23/24
>>     n_2 (A_z_C) = 11/12 * ( 1 * 1/(11/12 + 11/12 + 1/6) ) = 1/12
>>
>>     ...
>>
>>     In this way, the wheels are running!
>>
>>     So finally the A_C context will outupt either 'x' or 'y' as the
>>     best tag, which is in consistancy with the intuitive.
>>
>>     I guess you must be happy when you first invented the algorithm :)
>>
>>     So far, the finite-state transducer *minimization* part still
>>     remains a problem to me. I think I still need to spare some time
>>     to learn about it.
>>
>>     -----------------------------------------------------------split
>>     
>> line-----------------------------------------------------------------------------------------------------------
>>
>>     The following is the reply to your last mail.
>>
>>     > For that, you will have to study the current .tsx format and
>>     make sense of it, as your tagger will use exactly that format.
>>
>>     For the TXS format you mentioned, I've make sense of it, by
>>     reading the en-es package's example.
>>
>>     > Forbid rules can be applied to the input text before actually
>>     training or running the tagger. You will also need to find a good
>>     way to store probabilities or turn them into rules which can be
>>     read and perhaps edited using linguistic knowledge.
>>
>>     To be honest, I didn't quite catch your point here.
>>
>>     For example, we have a FORBID rule that forbids the tag sequence
>>     "a x", and we have training text as:
>>
>>     ^A/a/b$ ^B/x/y$
>>
>>     My question is:
>>
>>     It seems that we can't just drop 'a' in A and 'x' in B in the
>>     input text, because "a y" and "b x" are not forbidden. If 'a' of
>>     A and 'x' of B have been dropped together, we will never have "a
>>     y" and "b x". so how to apply forbid rules *before* training?
>>
>>     However, I came up with a way that seems can apply forbid rule
>>     *during tagging*. Let me explain.
>>
>>     For example, if we have a forbid rule as "a x", and the input
>>     sentence to be tagged is as following:
>>     ^A/a/b$ ^B/x/y$
>>
>>     Firstly, A has been tagged as 'a' with the help of context #_B (#
>>     is the sentence start). So, for B, the candidate 'x' is directly
>>     FORBIDDEN. This is what I mean by "during tagging".
>>
>>     However, there seems to be a problem with this appoach, that the
>>     forbid pair "a x" still occupies some probability during and
>>     after the training procedure. This might affect the precision of
>>     the tagger? What do you think of it?
>>
>>
>>     Look forward to your reply!
>>
>>
>>
>>     Best,
>>
>>     Gang
>>
>>
>>
>>
>>
>>     2013/4/21 Mikel Forcada <[email protected] <mailto:[email protected]>>
>>
>>         Gang,
>>
>>         great stuff; I haven't checked it exhaustively but as far as
>>         I am testing it seems to behave as expected.
>>
>>         Now it is time to move on to preparing your application. For
>>         that, you will have to study the current .tsx format and make
>>         sense of it, as your tagger will use exactly that format.
>>
>>         Forbid rules can be applied to the input text before actually
>>         training or running the tagger. You will also need to find a
>>         good way to store probabilities or turn them into rules which
>>         can be read and perhaps edited using linguistic knowledge.
>>
>>         Please do not hesitate to ask any questions to me or to the list.
>>
>>         Best,
>>
>>         Mikel
>>
>>
>>         -- 
>>         Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/ 
>> <http://www.dlsi.ua.es/%7Emlf/>)
>>         Departament de Llenguatges i Sistemes Informàtics
>>         Universitat d'Alacant
>>         E-03071 Alacant, Spain
>>         Phone: +34 96 590 9776
>>         Fax: +34 96 590 9326
>>
>>
>
>
>     -- 
>     Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/ 
> <http://www.dlsi.ua.es/%7Emlf/>)
>     Departament de Llenguatges i Sistemes Informàtics
>     Universitat d'Alacant
>     E-03071 Alacant, Spain
>     Phone: +34 96 590 9776
>     Fax: +34 96 590 9326
>
>


-- 
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Coding Challenge for idea "Sliding-window part-of-speech tagger"

Reply via email to