On 7/22/19 9:32 PM, Francis Tyers wrote: > El 2019-07-21 22:50, Amr Mohamed Hosny Anwar escribió: >> Dear Francis, Nick, Tommi, >> >> Hope this mail finds you well. >> I would like to share with the blog posts that I have used to document >> the project's progress. >> Firstly, The scores for the implemented methods that are computed using >> a custom script >> (https://github.com/apertium/lttoolbox/pull/55/files#diff-4791d142daa5e6d636af9488c64ef69a) >> >> >> can be found here https://ak-blog.herokuapp.com/posts/7/ >> >> Secondly, I have done my best searching for relevant publications >> related to keywords such as: Morphological Disambiguation. >> All the methods are supervised in one way or another. >> I have documented my notes for the list of relevant publications here: >> https://ak-blog.herokuapp.com/posts/9/ >> >> Finally, I have made some tweaks to the supervised model and implemented >> a model based on the analyses length. >> The model seems to be equivalent to the one that assigns the same weight >> to all the analyses and I believe this is a result of the way the >> lt-proc command works. >> You can check my explanation/findings here: >> https://ak-blog.herokuapp.com/posts/10/ >> >> Looking forward to reading your advice on how to proceed with the >> project. >> Additionally, Do you think we can make use of a parallel corpus for two >> languages in some way or another? >> I know a parallel corpus is also somehow supervised but my intuition is >> that finding/developing parallel corpora is easier than >> finding/developing a tagged corpus. >> >> Note: The blog is hosted using heroku as a free host so the first time >> you access a page might take some time to actually load :) >> > > How about using BPE to weight the possible analyses? > > e.g. > > 1) BPE will give you a segmentation it likes for a word, > "arabasız>lar>da" > > 2) analyser will give you various segmentations: > araba>sız>lar>da, arabasız>lar>da > > 3) you weight the segmentations that disagree with BPE higher for each > boundary > that isn't predicted by BPE > > > F. > Hi Francis,
I have checked the BPE segmentation paper. The idea is easy to grasp but I think morphological analyzers' output has a special format. To use BPE I will need to drop the analysis tags such as "<n>" and "<sg>". In order to validate that the BPE might be beneficial, I decided to compute certain statistics from the tiny English corpus that I am using. * Corpus size: 9098 tokens. * Unambiguous tokens (unique analysis for the token): 5830 tokens (64%) * Ambiguous tokens: 3268 tokens (36%) * Tokens having different segments: 533 tokens (5.86% of the corpus - 16.3% out of the ambiguous tokens) Example: * Surface token: oscillating * Analyses: oscillating<adj>/oscillate<vblex><pprs>/oscillate<vblex><subs>/oscillate<vblex><ger> * Segments: oscillating/oscillate/oscillate/oscillate Thus BPE will generate different segments for only 5.86% of the tokens. Additionally, Out of these tokens, most of them will have the same segmentation for their respective analyses. I am encouraged to use BPE in this way and I believe it won't make a big difference. Do you think these statistics will differ for languages such as German, Turkish, Finish who seem to have more complex compounding than English? Amr _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff