Re: [Apertium-stuff] Extend weighted transfer rules GSoC proposal

Aboelhamd Aly Fri, 05 Apr 2019 20:24:42 -0700

Hi Sevilay, hi spectei,

For sentence splitting, I think that we don't need to know neither syntax
nor sentence boundaries of the language.
Also I don't see any necessity for applying it in runtime, as in runtime we
only get the score of each pattern,
where there is no need for splitting. I also had one thought on using
beam-search here as I see it has no effect
and may be I am wrong. We can discuss in it after we close this thread.

We will handle the whole text as one unit and will depend only on the
captured patterns.
Knowing that in the chunker terms, successive patterns that don't share a
transfer rule, are independent.
So by using the lexical form of the text, we match the words with patterns,
then match patterns with rules.
And hence we know which patterns are ambiguous and how much ambiguous rules
they match.

For example if we have text with the following patterns and corresponding
rules numbers:
p1:2  p2:1  p3:6  p4:4  p5:3  p6:5  p7:1  p8:4  p9:4  p10:6  p11:8  p12:5
p13:5  p14:1  p15:3  p16:2

If such text was handled by our old method with generating all the
combinations possible (multiplication of rules numbers),
we would have 82944000 possible combinations, which are not practical at
all to score, and take heavy computations and memory.
And if it is handled by our new method with applying all ambiguous rules of
one pattern while fixing the other patterns at LRLM rule
(addition of rules numbers), we will have just 60 combinations, and not all
of them different, giving drastically low number of combinations,
which may be not so representative.

But if we apply the splitting idea , we will have something in the middle,
that will hopefully avoid the disadvantages of both methods
and benefit from advantages of both, too.
Let's proceed from the start of the text to the end of it, while
maintaining some threshold of say 24000 combinations.
p1 => 2  ,,  p1  p2 => 2  ,,  p1  p2  p3 => 12  ,,  p1  p2  p3  p4 => 48
,,  p1  p2  p3  p4  p5 => 144  ,,
p1  p2  p3  p4  p5  p6 => 720  ,,  p1  p2  p3  p4  p5  p6  p7 => 720
p1  p2  p3  p4  p5  p6  p7 p8 => 2880  ,,  p1  p2  p3  p4  p5  p6  p7  p8
p9 => 11520

And then we stop here, because taking the next pattern will exceed the
threshold.
Hence having our first split, we can now continue our work on it as usual.
But with more -non overwhelming- combinations which would capture more
semantics.
After that, we take the next split and so on.

-----------

I agree with you, that testing the current method with more than one pair
to know its accuracy is the priority,
and we currently working on it.

-----------

For an alternative for yasmet, I agree with spectei. Unfortunately, for now
I don't have a solid idea to discuss.
But in the few days, i will try to get one or more ideas to discuss.

On Fri, Apr 5, 2019 at 11:23 PM Francis Tyers <fty...@prompsit.com> wrote:

> El 2019-04-05 20:57, Sevilay Bayatlı escribió:
> > On Fri, 5 Apr 2019, 22:41 Francis Tyers, <fty...@prompsit.com> wrote:
> >
> >> El 2019-04-05 19:07, Sevilay Bayatlı escribió:
> >>> Hi Aboelhamd,
> >>>
> >>> There is some points in your proposal:
> >>>
> >>> First, I do not think "splitting sentence" is a good idea, each
> >>> language has different syntax, how could you know when you should
> >>> split the sentence.
> >>
> >> Apertium works on the concept of a stream of words, so in the
> >> runtime
> >> we can't really rely on robust sentence segmentation.
> >>
> >> We can often use it, e.g. for training, but if sentence boundary
> >> detection
> >> were to be included, it would need to be trained, as Sevilay hints
> >> at.
> >>
> >> Also, I'm not sure how much we would gain from that.
> >>
> >>> Second, "substitute yasmet with other method", I think the result
> >> will
> >>> not be more better if you substituted it with statistical method.
> >>>
> >>
> >> Substituting yasmet with a more up to date machine-learning method
> >> might be a worthwhile thing to do. What suggestions do you have?
> >>
> >> I think first we have to trying the exact method with more than 3
> >> language pairs and then decide  to substitute it or not, because
> >> what is the point of new method if dont achieve gain, then we can
> >> compare  the results of two methods and choose the best one. What do
> >> you think?
> >
>
> Yes, testing it with more language pairs is also a priority.
>
> Fran
>
>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Extend weighted transfer rules GSoC proposal

Reply via email to