Github user acidghost commented on the pull request:

    https://github.com/apache/spark/pull/6761#issuecomment-114978814
  
    @srowen So I wasn't able to find anything more about e1071 so opted for 
scikit learn which features both multinomial and bernoulli models 
([link](http://scikit-learn.org/stable/modules/naive_bayes.html)).
    
    This is the Python code I'm using:
    ``` python
    from sklearn.naive_bayes import MultinomialNB, BernoulliNB
    import pandas as pd
    import numpy as np
    
    multi = MultinomialNB()
    
    multi_train = pd.read_csv('multinomial.data.train', header = -1)
    multi_train_labels = multi_train.iloc[:, 0]
    multi_train_features = multi_train.iloc[:, 1:]
    
    multi_model = multi.fit(multi_train_features, multi_train_labels)
    
    multi_test = pd.read_csv('multinomial.data.test', header = -1)
    multi_test_labels = multi_test.iloc[:, 0]
    multi_test_features = multi_test.iloc[:, 1:]
    multi_pred = multi_model.predict(multi_test_features)
    
    print("Multinomial:\nNumber of mislabeled points out of a total %d points : 
%d" % (len(multi_test),(multi_test_labels != multi_pred).sum()))
    
    multi_probs = multi_model.predict_proba(multi_test_features)
    np.savetxt('multinomial.probs', multi_probs, delimiter=" ")
    
    
    
    bernoulli = BernoulliNB()
    
    bernoulli_train = pd.read_csv('bernoulli.data.train', header = -1)
    bernoulli_train_labels = bernoulli_train.iloc[:, 0]
    bernoulli_train_features = bernoulli_train.iloc[:, 1:]
    
    bernoulli_model = bernoulli.fit(bernoulli_train_features, 
bernoulli_train_labels)
    
    bernoulli_test = pd.read_csv('bernoulli.data.test', header = -1)
    bernoulli_test_labels = bernoulli_test.iloc[:, 0]
    bernoulli_test_features = bernoulli_test.iloc[:, 1:]
    bernoulli_pred = bernoulli_model.predict(bernoulli_test_features)
    
    print("Bernoulli:\nNumber of mislabeled points out of a total %d points : 
%d" % (len(bernoulli_test),(bernoulli_test_labels != bernoulli_pred).sum()))
    
    bernoulli_probs = bernoulli_model.predict_proba(bernoulli_test_features)
    np.savetxt('bernoulli.probs', bernoulli_probs, delimiter=" ")
    ```
    
    Those are the result from the tests:
    * the bernoulli test passes (the predictions are all the same as the 
scikit's ones),
    * for the multinomial I get different things at every run.. On average 20 
examples over 1000 have different predictions and on average 85 probability 
pairs over 1000 * 3 are discordant.
    
    The strangest thing is that I get different values at every run! For 
example in one run 16 examples diverge and in the other are 24 (and are neither 
the *same* examples). I am sure that I'm using the right data and they should 
be in the right order (or else the fact that the bernoulli one passes is even 
stranger).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to