Rahul K Bhojwani created SPARK-2433:
---------------------------------------

             Summary: The MLlib implementation for Naive Bayes in Spark 0.9.1 
is having a implementation bug.
                 Key: SPARK-2433
                 URL: https://issues.apache.org/jira/browse/SPARK-2433
             Project: Spark
          Issue Type: Bug
          Components: MLlib, PySpark
    Affects Versions: 0.9.1
         Environment: Any 
            Reporter: Rahul K Bhojwani


Don't have much experience with reporting errors. This is first time. If 
something is not clear please feel free to contact me (details given below)

In the pyspark mllib library. 
Path : \spark-0.9.1\python\pyspark\mllib\classification.py

Class: NaiveBayesModel

Method:  self.predict

Earlier Implementation:
def predict(self, x):
    """Return the most likely class for a data vector x"""
    return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
        

New Implementation:
No:1
def predict(self, x):
    """Return the most likely class for a data vector x"""
    return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))

No:2
def predict(self, x):
    """Return the most likely class for a data vector x"""
    return numpy.argmax(self.pi + dot(x,self.theta.T))

Explanation:
No:1 is correct according to me. Don't know about No:2.

Error one:
The matrix self.theta is of dimension [n_classes , n_features]. 
while the matrix x is of dimension [1 , n_features].

Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
It will always give error:  "ValueError: matrices are not aligned"
In the commented example given in the classification.py, n_classes = n_features 
= 2. That's why no error.

Both Implementation no.1 and Implementation no. 2 takes care of it.

Error 2:
As basic implementation of naive bayes is: P(class_n | sample) = 
count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)

and taking the class with max value.
That's what implementation 1 is doing.

In Implementation 2: 
Its basically class with max value :
( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
P(feature_n|class_n) * P(class_n))

Don't know if it gives the exact result.

Thanks
Rahul Bhojwani
[email protected]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to