Hi Francis, Sevilay, and all apertium mentors and contributors, I hope all of you are well and sound.
This is an update for the progress on the sklearn SVM model : After some long research on using svm model instead of max entropy. There was a misunderstanding of the dataset we generate, that unfortunately slowed my research and took it into a distant way than it should have gone. The problem was in the weights (fractional count) for each sample, as I didn't find clear answer for my question "how to include this weights in svm and how it really works", and my search took me into reading some papers about new proposed svm models that solve this problem but the papers were either off-topic or not clear enough to try and re-implement their idea. And fortunately finally, I found the solution beneath my legs in scikit-learn library, as some classification models were given the option of accepting weighted samples in training and testing, these models were "NaiveBayes", "SVM", "DecisionTree", "RandomForest", "AdaBoost". 1. Yasmet data format 0 $ 0.335668 # debate_0:0 monográfico_1:0 importante_2:0 # debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 importante_2:2 # 1 $ 0.329582 # debate_0:0 monográfico_1:0 importante_2:0 # debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 importante_2:2 # 2 $ 0.33475 # debate_0:0 monográfico_1:0 importante_2:0 # debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2 importante_2:2 # Sklearn data format 0 0.335668 debate monográfico importante 1 0.329582 debate monográfico importante 2 0.33475 debate monográfico importante 2. The dataset is data generated for one ambiguity. For example, the data sample above is generated from an ambiguous pattern "NOM ADJ ADJ", with the rules : rule-15 (NOM ADJ ADJ) class 0 rule-14 (NOM ADJ) , 32 (ADJ) class 1 rule-1 (NOM) , rule-32 (ADJ) , rule-32 (ADJ) class 2 3. Sklearn dataset is loaded into a pandas data-frame, where rule/class is the *target*, fractional count is the *sample weight* and pattern words are the *features*. 4. Features are encoded with a simple encoder called *OrdinalEncoder*. It just encodes words as numbers from *0* to *n-1*, where *n* is the number of unique words in the training dataset. The encoder works with each feature separately, that is the first feature is encoded from 0 to (*n**1* - 1) where *n**1 *is the number of unique words in that feature, and similarly the second feature is encoded from 0 to (*n2* - 1) where *n2* is the number of unique words in that feature, and so on. For example, if we have 3 features, each has only 2 unique words as follows : debate monográfico importante competitividad político económico The encoding will be as follows : debate=0 monográfico=1 importante=2 competitividad=0 político=1 económico=2 And this is a drawback as if we have a test sample *"*competitividad *económico* importante*"*, the encoder will raise an error because it didn't see *económico* before in the *second* feature, though it's an *adjective*. So a solution may be to use another encoder or just modify the use of the *OrdinalEncoder*. So we combine the features instead of having them separated, it should be encoded as follows : debate=0 monográfico=2 importante=4 competitividad=1 político=3 económico=5 This was just encoding a string as a number, but there are other encoders such as *OneHotEncoder *or yet other encoders not included in sklearn, which could catch useful semantics in their representation. * Do you recommend any specific encoder?* 5. SVM model accuracy is pretty similar to that of the max entropy model, which in turn similar to a *random* model. The accuracy approximately equals (1/*n*)*100 %, where *n *is the number of classes. For example, if a dataset has 2 classes the accuracy equals 50%, and if a dataset has 5 classes the accuracy equals 20% and so on. Why this behavior? I think maybe because of the language model scores, as they are very close to each other. What is the solution? I really don't know if another language model other than the n-gram, could improve the results, *or *maybe a better encoder could improve the results. I have an idea here but I don't know if it's right or not, and it's about using WER, PER, BLEU, or a combination of them, instead of the language model score for classification, so it will be our new sample weight. *Do you have any thoughts here?* 6. Some datasets sizes are very large that have hundreds of thousands or even millions of records, which make their training with sklearn SVM model not practical at all, as I tried to train a dataset with *800k* records and it took about *30* hours in my pc, and we have *7* datasets with more than *1* million records, and the largest dataset has more than *14* million records which would take about *21* days to train (by simple interpolation, actually it would take multiples of that). I had some options to choose from : - To work with a *c/c++* SVM library other than the *python* sklearn library. But since I have written scripts and a program to integrate sklearn with our module, it is not the easier solution to re-implement all of these. - To run the training using google *colab* with *GPU *enhancement. I tried this solution, and it was faster by about *30%*, but this was not enough because it has a running limit of *8* consecutive hours, so it will not be enough for the *800k* records dataset which will run in about *20* hours now. - To set a maximum threshold for the size of the records. By trying, I found that *200k *limit is a good choice, it's not too large nor too small, and takes about *1.5* hours. I chose this last solution, to be able to test all the scripts, the program, and sklearn integration with our module, and because about 15 datasets from 230 are the ones affected by the threshold. And now it all works well. *What do you recommend here, should I continue with the threshold solution, or are there other solutions?* 7. After finishing encoder and datasets size issues, we can then tune SVM hyper-parameters as penalty rate (*C *parameter) and the *kernel *to acheive better accuracy. As I was sick for more than 2 weeks, I will add 2-3 weeks of work after the end of GSoC. So this is the list of tasks to be done before 15 September : 1. Finishing the main task of GSoC idea, which is extending the module to interchunk and postchunk. I already began two days ago reading apertium2 documentation to remember and get familiar again with transfer rules files. But there still a problem with ambiguous transfer rules in .t2x and .t3x files, that is I don't know if there are enough if any ambiguous rules to test the module after finishing implementation. *So could you provide me with updates on that ?* 2. Writing a documentation for all programs and scripts usage. 3. Refactoring and integrating the module into apertium pipeline. As I didn't build up on the refactored version you asked me to integrate before GSoC. (Though I think this is hard to be finished before September 15) * Do you have any thoughts on what to do next?*
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff