Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/6761#issuecomment-115582804
  
    OK, maybe this is getting too far down a rabbit hole to hard-code some 
results from an implementation that we're not 100% sure is saying what we want. 
maybe it's simpler to just directly compute in the test what the probabilities 
should be, given the model.
    
    For example, in the case of Multinomial, you have this vector pi of C 
values, and this matrix theta with C rows and D columns. The probability of 
class 0 is the sum of pi(0) and the dot product of row 0 of theta with your 
data, with that whole sum exponentiated by e to get a final unnormalized 
probability for class 0. Then the unnormalized probs over all classes are 
normalized to sum to 1. Those results ought to be very close to the output of 
the model -- since it's what the model computes almost word for word!
    
    In that sense it almost feels redundant, but, it is coding the definition 
of the prediction in the test, which is appropriate. Later if the 
implementation changes, the test is still checking vs the naive straightforward 
computation.
    
    For Bernoulli, it's similar except that you're adding pi(0), and then 
adding the elements of theta where the input is 1, but log(1-exp(theta)) where 
the input is 0. 
    
    I realize that's not a great description so I can assist writing this part 
if it would help
    
    Also I noticed a potential tiny inaccuracy in how the Naive bayes Bernoulli 
computation works.  `math.log(1.0 - math.exp(value))` becomes inaccurate when 
value is pretty negative. `math.log1p(-math.exp(value))` is more accurate in 
this case. It could matter at some level if we're asserting about the exact 
probability, and probabilities are often tiny in the output.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to