Yanbo Liang created SPARK-17835:
-----------------------------------
Summary: Optimize NaiveBayes mllib wrapper to eliminate extra pass
on data
Key: SPARK-17835
URL: https://issues.apache.org/jira/browse/SPARK-17835
Project: Spark
Issue Type: Bug
Components: ML, MLlib
Reporter: Yanbo Liang
SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and left
ml as a wrapper. However, there are some difference between mllib and ml to
handle {{labels}}:
* mllib allow input labels as {-1, +1}, however, ml assumes the input labels in
range [0, numClasses).
* mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the
assumption mention above.
During the copy in SPARK-14077, we use {{val labels =
data.map(_.label).distinct().collect().sorted}} to get the distinct labels
firstly, and then feed to training. It inovlves another extra Spark job
compared with the original implementation. Since {{NaiveBayes}} only do one
aggregation during training, add another one seems not efficient. We can get
the labels in a single pass along with {{NaiveBayes}} training and send them to
MLlib side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]