Yanbo Liang created SPARK-17835:
-----------------------------------

             Summary: Optimize NaiveBayes mllib wrapper to eliminate extra pass 
on data
                 Key: SPARK-17835
                 URL: https://issues.apache.org/jira/browse/SPARK-17835
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
            Reporter: Yanbo Liang


SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left 
ml as a wrapper. However, there are some difference between mllib and ml to 
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in 
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the 
assumption mention above.
During the copy in SPARK-14077, we use {{val labels = 
data.map(_.label).distinct().collect().sorted}} to get the distinct labels 
firstly, and then feed to training. It inovlves another extra Spark job 
compared with the original implementation. Since {{NaiveBayes}} only do one 
aggregation during training, add another one seems not efficient. We can get 
the labels in a single pass along with {{NaiveBayes}} training and send them to 
MLlib side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to