Dear Francis, Tommi, Nick, Apertiumers?, Hope this mail finds you well.
I would like to arrange a meeting with you and interested lttoolbox/apertium maintainers to set a plan for merging the project's codes to the master branch. We will need to figure out whether we will only need to use (non-bash) shell scripts or not. I have used vanilla python scripts for evaluation and also in some weighting algorithms such as the supervised weighting method (https://github.com/apertium/lttoolbox/blob/19abb2121b89dee88f98472da34fc82ad9090c05/scripts/annotated_corpus_to_weightlist.py). Regarding the project's progress, I have finally came to a proper experiment using the word2vec CBOW model to estimate weights for a fst. So the implementation is: 1) Use gensim to get a word2vec model trained (I used a toy dataset of size 100MB called text8 which is a subset of wikipedia's articles) 2) For each ambiguous word (one that has more than one analysis), 2.1) Get its context words (neighbouring words at distance <= window_size which is 2 in the experiment). 2.2) Use the word2vec CBOW model to predict the 10 most probable words given the context words 2.3) Find the analyses for these words and drop all the ambiguous ones 2.4) Compare the tags for the original word's analyses to the tags of the un-ambiguous similar words 2.5) Count the number of times a certain tag for the ambiguous word matched that of un-ambiguous similar word The code's bottleneck is actually step 2.2 and quickly investigating some pull requests, this seems to be due to the way the function is implemented (Word2Vec.predict_output_word). The dataset used was a concatenation of the text files found in the apertium-eng/text/old/*.raw.txt (https://github.com/apertium/apertium-eng/tree/master/texts/old) The precision of the model is 0.69530 and the recall is 0.69918 which is about 1% better than the baseline naive equally-probable weighting paradigm (precision 0.68683 and recall 0.68682). I have also tweaked the lt-weight script to support multiple weightlists. So, It will use the first weightlist to produce a WFST. Then the second weightlist will be kind of a fallback for the analyses that are still not weighted. And so on. This will be beneficial in two cases: 1) Using Laplace smoothing where the default value for OOV analyses will depend on the corpus length which is unknown to the lt-weight script 2) Having multiple weightlists as follows: The first one has completely defined (no regex) weighted regex (Example: cat<n><pl>::100) The second one has more generic weighted regex (Example: ?* <n><pl>::1000) Finally, I have the following items on my todo-list: 1) Test the scripts on breton and kazakh languages 2) Rebase the commits and make sure I have commited all the changes to the branch so that it's ready for reviewing Looking forward to the upcoming weeks :D Thanks, Amr _______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff