Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/6761#issuecomment-115582804
OK, maybe this is getting too far down a rabbit hole to hard-code some
results from an implementation that we're not 100% sure is saying what we want.
maybe it's simpler to just directly compute in the test what the probabilities
should be, given the model.
For example, in the case of Multinomial, you have this vector pi of C
values, and this matrix theta with C rows and D columns. The probability of
class 0 is the sum of pi(0) and the dot product of row 0 of theta with your
data, with that whole sum exponentiated by e to get a final unnormalized
probability for class 0. Then the unnormalized probs over all classes are
normalized to sum to 1. Those results ought to be very close to the output of
the model -- since it's what the model computes almost word for word!
In that sense it almost feels redundant, but, it is coding the definition
of the prediction in the test, which is appropriate. Later if the
implementation changes, the test is still checking vs the naive straightforward
computation.
For Bernoulli, it's similar except that you're adding pi(0), and then
adding the elements of theta where the input is 1, but log(1-exp(theta)) where
the input is 0.
I realize that's not a great description so I can assist writing this part
if it would help
Also I noticed a potential tiny inaccuracy in how the Naive bayes Bernoulli
computation works. `math.log(1.0 - math.exp(value))` becomes inaccurate when
value is pretty negative. `math.log1p(-math.exp(value))` is more accurate in
this case. It could matter at some level if we're asserting about the exact
probability, and probabilities are often tiny in the output.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]