[Apertium-stuff] GSoC19: Unsupervised weighting of automata - final weeks plans/ progress report

Amr Mohamed Hosny Anwar Thu, 08 Aug 2019 14:31:59 -0700

Dear Francis, Tommi, Nick, Apertiumers?,

Hope this mail finds you well.


I would like to arrange a meeting with you and interested 
lttoolbox/apertium maintainers to set a plan for merging the project's 
codes to the master branch.
We will need to figure out whether we will only need to use (non-bash) 
shell scripts or not.
I have used vanilla python scripts for evaluation and also in some 
weighting algorithms such as the supervised weighting method 
(https://github.com/apertium/lttoolbox/blob/19abb2121b89dee88f98472da34fc82ad9090c05/scripts/annotated_corpus_to_weightlist.py).

Regarding the project's progress, I have finally came to a proper 
experiment using the word2vec CBOW model to estimate weights for a fst.
So the implementation is:
1) Use gensim to get a word2vec model trained (I used a toy dataset of 
size 100MB called text8 which is a subset of wikipedia's articles)
2) For each ambiguous word (one that has more than one analysis),
     2.1) Get its context words (neighbouring words at distance <= 
window_size which is 2 in the experiment).
     2.2) Use the word2vec CBOW model to predict the 10 most probable 
words given the context words
     2.3) Find the analyses for these words and drop all the ambiguous ones
     2.4) Compare the tags for the original word's analyses to the tags 
of the un-ambiguous similar words
     2.5) Count the number of times a certain tag for the ambiguous word 
matched that of un-ambiguous similar word

The code's bottleneck is actually step 2.2 and quickly investigating 
some pull requests, this seems to be due to the way the function is 
implemented (Word2Vec.predict_output_word).

The dataset used was a concatenation of the text files found in the 
apertium-eng/text/old/*.raw.txt 
(https://github.com/apertium/apertium-eng/tree/master/texts/old)
The precision of the model is 0.69530 and the recall is 0.69918 which is 
about 1% better than the baseline naive equally-probable weighting 
paradigm (precision 0.68683 and recall 0.68682).

I have also tweaked the lt-weight script to support multiple weightlists.
So, It will use the first weightlist to produce a WFST.
Then the second weightlist will be kind of a fallback for the analyses 
that are still not weighted.
And so on.
This will be beneficial in two cases:
1) Using Laplace smoothing where the default value for OOV analyses will 
depend on the corpus length which is unknown to the lt-weight script
2) Having multiple weightlists as follows:
The first one has completely defined (no regex) weighted regex (Example: 
cat<n><pl>::100)
The second one has more generic weighted regex (Example: ?* <n><pl>::1000)

Finally, I have the following items on my todo-list:
1) Test the scripts on breton and kazakh languages
2) Rebase the commits and make sure I have commited all the changes to 
the branch so that it's ready for reviewing

Looking forward to the upcoming weeks :D

Thanks,
Amr


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

[Apertium-stuff] GSoC19: Unsupervised weighting of automata - final weeks plans/ progress report

Reply via email to