Hi Francis, Sevilay, and all apertium mentors and contributors,

I hope all of you are well and sound.

This is an update for the progress on the sklearn SVM model :

After some long research on using svm model instead of max entropy. There
was a misunderstanding of the dataset we generate, that unfortunately
slowed my research and took it into a distant way than it should have gone.
The problem was in the weights (fractional count) for each sample, as I
didn't find clear answer for my question "how to include this weights in
svm and how it really works", and my search took me into reading some
papers about new proposed svm models that solve this problem but the papers
were either off-topic or not clear enough to try and re-implement their
idea.
And fortunately finally, I found the solution beneath my legs in
scikit-learn library, as some classification models were given the option
of accepting weighted samples in training and testing, these models were
"NaiveBayes", "SVM", "DecisionTree", "RandomForest", "AdaBoost".

    1. Yasmet data format
         0 $ 0.335668 # debate_0:0 monográfico_1:0 importante_2:0 #
debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
importante_2:2 #
         1 $ 0.329582 # debate_0:0 monográfico_1:0 importante_2:0 #
debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
importante_2:2 #
         2 $ 0.33475 # debate_0:0 monográfico_1:0 importante_2:0 #
debate_0:1 monográfico_1:1 importante_2:1 # debate_0:2 monográfico_1:2
importante_2:2 #

        Sklearn data format
        0  0.335668  debate  monográfico  importante
        1  0.329582  debate  monográfico  importante
        2  0.33475  debate  monográfico  importante


    2. The dataset is data generated for one ambiguity. For example, the
data sample above is generated from an ambiguous pattern "NOM ADJ ADJ",
with the rules :
        rule-15 (NOM ADJ ADJ)
     class 0
        rule-14 (NOM ADJ) ,  32 (ADJ)
   class 1
        rule-1 (NOM) ,  rule-32 (ADJ) ,  rule-32 (ADJ)
class 2


    3. Sklearn dataset is loaded into a pandas data-frame, where rule/class
is the *target*, fractional count is the *sample weight* and pattern words
are the *features*.


    4. Features are encoded with a simple encoder called *OrdinalEncoder*.
It just encodes words as numbers from *0* to *n-1*, where *n* is the number
of unique words
        in the training dataset. The encoder works with each feature
separately, that is the first feature is encoded from 0 to (*n**1* - 1)
where *n**1 *is the number of unique
        words in that feature, and similarly the second feature is encoded
from 0 to (*n2* - 1) where *n2* is the number of unique words in that
feature, and so on.
        For example, if we have 3 features, each has only 2 unique words as
follows :
              debate monográfico importante
              competitividad político económico

       The encoding will be as follows :
             debate=0  monográfico=1  importante=2
             competitividad=0  político=1  económico=2

      And this is a drawback as if we have a test sample *"*competitividad
*económico* importante*"*, the encoder will raise an error because it
didn't see *económico*
      before in the *second* feature, though it's an *adjective*. So a
solution may be to use another encoder or just modify the use of the
*OrdinalEncoder*.
      So we combine the features instead of having them separated, it
should be encoded as follows :
            debate=0  monográfico=2  importante=4
            competitividad=1  político=3  económico=5

      This was just encoding a string as a number, but there are other
encoders such as *OneHotEncoder *or yet other encoders not included in
sklearn, which
      could catch useful semantics in their representation.

*      Do you recommend any specific encoder?*


    5. SVM model accuracy is pretty similar to that of the max entropy
model, which in turn similar to a *random* model. The accuracy
approximately equals
        (1/*n*)*100 %, where *n *is the number of classes. For example, if
a dataset has 2 classes the accuracy equals 50%, and if a dataset has 5
classes the accuracy
        equals 20% and so on. Why this behavior? I think maybe because of
the language model scores, as they are very close to each other. What is
the solution?
        I really don't know if another language model other than the
n-gram, could improve the results, *or *maybe a better encoder could
improve the results.

        I have an idea here but I don't know if it's right or not, and it's
about using WER, PER, BLEU, or a combination of them, instead of the
language model score
        for classification, so it will be our new sample weight.

        *Do you have any thoughts here?*


    6. Some datasets sizes are very large that have hundreds of thousands
or even millions of records, which make their training with sklearn SVM
model not practical
        at all, as I tried to train a dataset with *800k* records and it
took about *30* hours in my pc, and we have *7* datasets with more than *1*
million records, and the largest
        dataset has more than *14* million records which would take about
*21* days to train (by simple interpolation, actually it would take
multiples of that).

        I had some options to choose from :
          - To work with a *c/c++* SVM library other than the *python*
sklearn library. But since I have written scripts and a program to
integrate sklearn with our module,
             it is not the easier solution to re-implement all of these.

          - To run the training using google *colab* with *GPU *enhancement.
I tried this solution, and it was faster by about *30%*, but this was not
enough because it has
            a running limit of *8* consecutive hours, so it will not be
enough for the *800k* records dataset which will run in about *20* hours
now.

          - To set a maximum threshold for the size of the records. By
trying, I found that *200k *limit is a good choice, it's not too large nor
too small, and takes about
            *1.5* hours.

           I chose this last solution, to be able to test all the scripts,
the program, and sklearn integration with our module, and because about 15
datasets from 230
           are the ones affected by the threshold. And now it all works
well.

        *What do you recommend here, should I continue with the threshold
solution, or are there other solutions?*


   7. After finishing encoder and datasets size issues, we can then tune
SVM hyper-parameters as penalty rate (*C *parameter) and the *kernel *to
acheive
       better accuracy.



As I was sick for more than 2 weeks, I will add 2-3 weeks of work after the
end of GSoC. So this is the list of tasks to be done before 15 September :

   1. Finishing the main task of GSoC idea, which is extending the module
to interchunk and postchunk. I already began two days ago reading apertium2
       documentation to remember and get familiar again with transfer rules
files. But there still a problem with ambiguous transfer rules in .t2x and
.t3x files,
       that is I don't know if there are enough if any ambiguous rules to
test the module after finishing implementation.

       *So could you provide me with updates on that ?*


   2. Writing a documentation for all programs and scripts usage.


   3. Refactoring and integrating the module into apertium pipeline. As I
didn't build up on the refactored version you asked me to integrate before
GSoC.
       (Though I think this is hard to be finished before September 15)


*   Do you have any thoughts on what to do next?*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to