Github user acidghost commented on the pull request:
https://github.com/apache/spark/pull/6761#issuecomment-115555334
I found that e1071 uses a Gaussian distribution ([page
34](http://cran.r-project.org/web/packages/e1071/e1071.pdf)), so I wouldn't use
the results from that package.
The mllib predictions test (sum to one and more that 80% correct
predictions) both pass for Bernoulli and Multinomial.
Comparing the scikit and mllib probabilities I have a stable result (all
matches) only with the Bernoulli. With the Multinomial I get different results
at every run. If I could use another library to compute the probabilities, I
would compare those with the mllib ones, as you suggest. Do you know any with
both Bernoulli and Multinomial models?
Anyway is strange that only the Multinomial results are wrong. Might it be
that the data generation function for Multinomial data is more random? Or is it
the prediction algorithm?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]