Hi Sevilay, hi spectei, For sentence splitting, I think that we don't need to know neither syntax nor sentence boundaries of the language. Also I don't see any necessity for applying it in runtime, as in runtime we only get the score of each pattern, where there is no need for splitting. I also had one thought on using beam-search here as I see it has no effect and may be I am wrong. We can discuss in it after we close this thread.
We will handle the whole text as one unit and will depend only on the captured patterns. Knowing that in the chunker terms, successive patterns that don't share a transfer rule, are independent. So by using the lexical form of the text, we match the words with patterns, then match patterns with rules. And hence we know which patterns are ambiguous and how much ambiguous rules they match. For example if we have text with the following patterns and corresponding rules numbers: p1:2 p2:1 p3:6 p4:4 p5:3 p6:5 p7:1 p8:4 p9:4 p10:6 p11:8 p12:5 p13:5 p14:1 p15:3 p16:2 If such text was handled by our old method with generating all the combinations possible (multiplication of rules numbers), we would have 82944000 possible combinations, which are not practical at all to score, and take heavy computations and memory. And if it is handled by our new method with applying all ambiguous rules of one pattern while fixing the other patterns at LRLM rule (addition of rules numbers), we will have just 60 combinations, and not all of them different, giving drastically low number of combinations, which may be not so representative. But if we apply the splitting idea , we will have something in the middle, that will hopefully avoid the disadvantages of both methods and benefit from advantages of both, too. Let's proceed from the start of the text to the end of it, while maintaining some threshold of say 24000 combinations. p1 => 2 ,, p1 p2 => 2 ,, p1 p2 p3 => 12 ,, p1 p2 p3 p4 => 48 ,, p1 p2 p3 p4 p5 => 144 ,, p1 p2 p3 p4 p5 p6 => 720 ,, p1 p2 p3 p4 p5 p6 p7 => 720 p1 p2 p3 p4 p5 p6 p7 p8 => 2880 ,, p1 p2 p3 p4 p5 p6 p7 p8 p9 => 11520 And then we stop here, because taking the next pattern will exceed the threshold. Hence having our first split, we can now continue our work on it as usual. But with more -non overwhelming- combinations which would capture more semantics. After that, we take the next split and so on. ----------- I agree with you, that testing the current method with more than one pair to know its accuracy is the priority, and we currently working on it. ----------- For an alternative for yasmet, I agree with spectei. Unfortunately, for now I don't have a solid idea to discuss. But in the few days, i will try to get one or more ideas to discuss. On Fri, Apr 5, 2019 at 11:23 PM Francis Tyers <fty...@prompsit.com> wrote: > El 2019-04-05 20:57, Sevilay Bayatlı escribió: > > On Fri, 5 Apr 2019, 22:41 Francis Tyers, <fty...@prompsit.com> wrote: > > > >> El 2019-04-05 19:07, Sevilay Bayatlı escribió: > >>> Hi Aboelhamd, > >>> > >>> There is some points in your proposal: > >>> > >>> First, I do not think "splitting sentence" is a good idea, each > >>> language has different syntax, how could you know when you should > >>> split the sentence. > >> > >> Apertium works on the concept of a stream of words, so in the > >> runtime > >> we can't really rely on robust sentence segmentation. > >> > >> We can often use it, e.g. for training, but if sentence boundary > >> detection > >> were to be included, it would need to be trained, as Sevilay hints > >> at. > >> > >> Also, I'm not sure how much we would gain from that. > >> > >>> Second, "substitute yasmet with other method", I think the result > >> will > >>> not be more better if you substituted it with statistical method. > >>> > >> > >> Substituting yasmet with a more up to date machine-learning method > >> might be a worthwhile thing to do. What suggestions do you have? > >> > >> I think first we have to trying the exact method with more than 3 > >> language pairs and then decide to substitute it or not, because > >> what is the point of new method if dont achieve gain, then we can > >> compare the results of two methods and choose the best one. What do > >> you think? > > > > Yes, testing it with more language pairs is also a priority. > > Fran > > > _______________________________________________ > Apertium-stuff mailing list > Apertium-stuff@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/apertium-stuff >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff