Sleep well, Gang.
I found this paper of ours
http://altea.dlsi.ua.es/~mlf/docum/sanchezvillamil05p.pdf where we use a
different model. As it works on tag space and not on ambiguity class
space, perhaps it can be used to implement forbids. I have to check...
Mikel
Al 04/24/2013 07:01 PM, En/na Gang Chen ha escrit:
> Hi, MIkel,
>
> Thanks for your prompt and informative reply!
>
> I will think carefully about the questions and make updates to the
> application.
>
> Because of time difference, I have to get some rest.
>
>
> Best,
>
> Gang
>
>
>
> 2013/4/25 Mikel Forcada <[email protected] <mailto:[email protected]>>
>
> Gang,
> my second round of responses is http://piratepad.net/gang-s-notes
> I have to think harder because I don't know how to solve the
> problem you pose.
>
> Mikel
>
> Al 04/24/2013 05:57 PM, En/na Gang Chen ha escrit:
>> Hi,Mikel,
>>
>> Thank you for your guidance!
>>
>> During the last 2 days, I was mainly focused on reading the paper
>> and writing my application. The good news are that I understand
>> the unsupervised traning alroghtm, which I think is indeed the
>> most mathematically heavy part, and that the first draft of
>> application is done :)
>>
>> First of all, would you please have a look over the application
>> draft? Your advices are always welcome! Please see here:
>> https://github.com/elephantgcc/gsoc-2013/blob/master/Application
>>
>>
>> -----------------------------------------------------------split
>>
>> line--------------------------------------------------------------------------------------------------------------------
>>
>> OK, I can't wait to share with you the example that helped me to
>> go throught all the maths in the unserpervised part of the paper.
>>
>> Suppose that we only have two sentences:
>> A ^B/x/y$ C
>> A ^D/x/y/z$ C
>>
>> where A, B and C are all words, x, y, and z are possible tags for
>> the context A_C, and we only focus on this context.
>>
>> At the initial stage,
>>
>> n_0 (A_x_C) = 1 * 1/2 + 1 * 1/3 = 5/6
>> n_0 (A_y_C) = 1 * 1/2 + 1 * 1/3 = 5/6
>> n_0 (A_z_C) = 1 * 1/3 = 1/3
>>
>> where n_0 (A_x_C) donates the estimated count that context A_C
>> should tag the middle word as 'x', at the 0-th iteration.
>>
>> then using equation (10) in the paper, the iteration begins with
>> the 1-st iteration,
>>
>> n_1 (A_x_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3)
>> ) = 11/12
>> n_1 (A_y_C) = 5/6 * ( 1 * 1/(5/6 + 5/6) + 1 * 1/(5/6 + 5/6 + 1/3)
>> ) = 11/12
>> n_1 (A_z_C) = 5/6 * ( 1 * 1/(5/6 + 5/6 + 1/3) ) = 1/6
>>
>> and to the 2-nd iteration:
>>
>> n_2 (A_x_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 +
>> 11/12 + 1/6) ) = 23/24
>> n_2 (A_y_C) = 11/12 * ( 1 * 1/(11/12 + 11/12) + 1 * 1/(11/12 +
>> 11/12 + 1/6) ) = 23/24
>> n_2 (A_z_C) = 11/12 * ( 1 * 1/(11/12 + 11/12 + 1/6) ) = 1/12
>>
>> ...
>>
>> In this way, the wheels are running!
>>
>> So finally the A_C context will outupt either 'x' or 'y' as the
>> best tag, which is in consistancy with the intuitive.
>>
>> I guess you must be happy when you first invented the algorithm :)
>>
>> So far, the finite-state transducer *minimization* part still
>> remains a problem to me. I think I still need to spare some time
>> to learn about it.
>>
>> -----------------------------------------------------------split
>>
>> line-----------------------------------------------------------------------------------------------------------
>>
>> The following is the reply to your last mail.
>>
>> > For that, you will have to study the current .tsx format and
>> make sense of it, as your tagger will use exactly that format.
>>
>> For the TXS format you mentioned, I've make sense of it, by
>> reading the en-es package's example.
>>
>> > Forbid rules can be applied to the input text before actually
>> training or running the tagger. You will also need to find a good
>> way to store probabilities or turn them into rules which can be
>> read and perhaps edited using linguistic knowledge.
>>
>> To be honest, I didn't quite catch your point here.
>>
>> For example, we have a FORBID rule that forbids the tag sequence
>> "a x", and we have training text as:
>>
>> ^A/a/b$ ^B/x/y$
>>
>> My question is:
>>
>> It seems that we can't just drop 'a' in A and 'x' in B in the
>> input text, because "a y" and "b x" are not forbidden. If 'a' of
>> A and 'x' of B have been dropped together, we will never have "a
>> y" and "b x". so how to apply forbid rules *before* training?
>>
>> However, I came up with a way that seems can apply forbid rule
>> *during tagging*. Let me explain.
>>
>> For example, if we have a forbid rule as "a x", and the input
>> sentence to be tagged is as following:
>> ^A/a/b$ ^B/x/y$
>>
>> Firstly, A has been tagged as 'a' with the help of context #_B (#
>> is the sentence start). So, for B, the candidate 'x' is directly
>> FORBIDDEN. This is what I mean by "during tagging".
>>
>> However, there seems to be a problem with this appoach, that the
>> forbid pair "a x" still occupies some probability during and
>> after the training procedure. This might affect the precision of
>> the tagger? What do you think of it?
>>
>>
>> Look forward to your reply!
>>
>>
>>
>> Best,
>>
>> Gang
>>
>>
>>
>>
>>
>> 2013/4/21 Mikel Forcada <[email protected] <mailto:[email protected]>>
>>
>> Gang,
>>
>> great stuff; I haven't checked it exhaustively but as far as
>> I am testing it seems to behave as expected.
>>
>> Now it is time to move on to preparing your application. For
>> that, you will have to study the current .tsx format and make
>> sense of it, as your tagger will use exactly that format.
>>
>> Forbid rules can be applied to the input text before actually
>> training or running the tagger. You will also need to find a
>> good way to store probabilities or turn them into rules which
>> can be read and perhaps edited using linguistic knowledge.
>>
>> Please do not hesitate to ask any questions to me or to the list.
>>
>> Best,
>>
>> Mikel
>>
>>
>> --
>> Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/
>> <http://www.dlsi.ua.es/%7Emlf/>)
>> Departament de Llenguatges i Sistemes Informàtics
>> Universitat d'Alacant
>> E-03071 Alacant, Spain
>> Phone: +34 96 590 9776
>> Fax: +34 96 590 9326
>>
>>
>
>
> --
> Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/
> <http://www.dlsi.ua.es/%7Emlf/>)
> Departament de Llenguatges i Sistemes Informàtics
> Universitat d'Alacant
> E-03071 Alacant, Spain
> Phone: +34 96 590 9776
> Fax: +34 96 590 9326
>
>
--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff