El 2019-07-24 14:34, Amr Mohamed Hosny Anwar escribió:
On 7/22/19 9:32 PM, Francis Tyers wrote:
El 2019-07-21 22:50, Amr Mohamed Hosny Anwar escribió:
Dear Francis, Nick, Tommi,

Hope this mail finds you well.
I would like to share with the blog posts that I have used to document
the project's progress.
Firstly, The scores for the implemented methods that are computed using
a custom script
(https://github.com/apertium/lttoolbox/pull/55/files#diff-4791d142daa5e6d636af9488c64ef69a)

can be found here https://ak-blog.herokuapp.com/posts/7/

Secondly, I have done my best searching for relevant publications
related to keywords such as: Morphological Disambiguation.
All the methods are supervised in one way or another.
I have documented my notes for the list of relevant publications here:
https://ak-blog.herokuapp.com/posts/9/

Finally, I have made some tweaks to the supervised model and implemented
a model based on the analyses length.
The model seems to be equivalent to the one that assigns the same weight
to all the analyses and I believe this is a result of the way the
lt-proc command works.
You can check my explanation/findings here:
https://ak-blog.herokuapp.com/posts/10/

Looking forward to reading your advice on how to proceed with the
project.
Additionally, Do you think we can make use of a parallel corpus for two
languages in some way or another?
I know a parallel corpus is also somehow supervised but my intuition is
that finding/developing parallel corpora is easier than
finding/developing a tagged corpus.

Note: The blog is hosted using heroku as a free host so the first time
you access a page might take some time to actually load :)


How about using BPE to weight the possible analyses?

e.g.

1) BPE will give you a segmentation it likes for a word,
"arabasız>lar>da"

2) analyser will give you various segmentations:
     araba>sız>lar>da, arabasız>lar>da

3) you weight the segmentations that disagree with BPE higher for each
boundary
     that isn't predicted by BPE


F.

Hi Francis,

I have checked the BPE segmentation paper.

The idea is easy to grasp but I think morphological analyzers' output
has a special format.
To use BPE I will need to drop the analysis tags such as "<n>" and "<sg>".
In order to validate that the BPE might be beneficial, I decided to
compute certain statistics from the tiny English corpus that I am using.

* Corpus size: 9098 tokens.
* Unambiguous tokens (unique analysis for the token): 5830 tokens (64%)
* Ambiguous tokens: 3268 tokens (36%)
    * Tokens having different segments: 533 tokens (5.86% of the corpus
- 16.3% out of the ambiguous tokens)
     Example:
         * Surface token: oscillating
         * Analyses:
oscillating<adj>/oscillate<vblex><pprs>/oscillate<vblex><subs>/oscillate<vblex><ger>
         * Segments: oscillating/oscillate/oscillate/oscillate

Well, here the segments would be:

oscillating
oscillat>ing

Thus BPE will generate different segments for only 5.86% of the tokens.

It's not a good idea to try on English which has a fairly impoverished morphology.

Additionally, Out of these tokens, most of them will have the same
segmentation for their respective analyses.

This is true, multiple analyses for the same segmentation would not get any benefit from BPE. But you could potentially do BPE + something else, e.g.
use BPE to set initial weights, then "Do Something Clever"[tm].

I am encouraged to use BPE in this way and I believe it won't make a big
difference.
Do you think these statistics will differ for languages such as German,
Turkish, Finish who seem to have more complex compounding than English?


Yes, they could be very different. I think Tommi has had some experience
in integrating Morfessor (a similar system to BPE) and transducers.

Tommi, could you let us know what you tried?

F.


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to