Github user acidghost commented on the pull request:
https://github.com/apache/spark/pull/6761#issuecomment-114978814
@srowen So I wasn't able to find anything more about e1071 so opted for
scikit learn which features both multinomial and bernoulli models
([link](http://scikit-learn.org/stable/modules/naive_bayes.html)).
This is the Python code I'm using:
``` python
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
import pandas as pd
import numpy as np
multi = MultinomialNB()
multi_train = pd.read_csv('multinomial.data.train', header = -1)
multi_train_labels = multi_train.iloc[:, 0]
multi_train_features = multi_train.iloc[:, 1:]
multi_model = multi.fit(multi_train_features, multi_train_labels)
multi_test = pd.read_csv('multinomial.data.test', header = -1)
multi_test_labels = multi_test.iloc[:, 0]
multi_test_features = multi_test.iloc[:, 1:]
multi_pred = multi_model.predict(multi_test_features)
print("Multinomial:\nNumber of mislabeled points out of a total %d points :
%d" % (len(multi_test),(multi_test_labels != multi_pred).sum()))
multi_probs = multi_model.predict_proba(multi_test_features)
np.savetxt('multinomial.probs', multi_probs, delimiter=" ")
bernoulli = BernoulliNB()
bernoulli_train = pd.read_csv('bernoulli.data.train', header = -1)
bernoulli_train_labels = bernoulli_train.iloc[:, 0]
bernoulli_train_features = bernoulli_train.iloc[:, 1:]
bernoulli_model = bernoulli.fit(bernoulli_train_features,
bernoulli_train_labels)
bernoulli_test = pd.read_csv('bernoulli.data.test', header = -1)
bernoulli_test_labels = bernoulli_test.iloc[:, 0]
bernoulli_test_features = bernoulli_test.iloc[:, 1:]
bernoulli_pred = bernoulli_model.predict(bernoulli_test_features)
print("Bernoulli:\nNumber of mislabeled points out of a total %d points :
%d" % (len(bernoulli_test),(bernoulli_test_labels != bernoulli_pred).sum()))
bernoulli_probs = bernoulli_model.predict_proba(bernoulli_test_features)
np.savetxt('bernoulli.probs', bernoulli_probs, delimiter=" ")
```
Those are the result from the tests:
* the bernoulli test passes (the predictions are all the same as the
scikit's ones),
* for the multinomial I get different things at every run.. On average 20
examples over 1000 have different predictions and on average 85 probability
pairs over 1000 * 3 are discordant.
The strangest thing is that I get different values at every run! For
example in one run 16 examples diverge and in the other are 24 (and are neither
the *same* examples). I am sure that I'm using the right data and they should
be in the right order (or else the fact that the bernoulli one passes is even
stranger).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]