[ https://issues.apache.org/jira/browse/SPARK-17835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanbo Liang resolved SPARK-17835. --------------------------------- Resolution: Fixed Assignee: Yanbo Liang Fix Version/s: 2.1.0 > Optimize NaiveBayes mllib wrapper to eliminate extra pass on data > ----------------------------------------------------------------- > > Key: SPARK-17835 > URL: https://issues.apache.org/jira/browse/SPARK-17835 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Reporter: Yanbo Liang > Assignee: Yanbo Liang > Fix For: 2.1.0 > > > SPARK-14077 copied the {{NaiveBayes}} implementation from mllib to ml and > left mllib as a wrapper. However, there are some difference between mllib and > ml to handle {{labels}}: > * mllib allow input labels as {-1, +1}, however, ml assumes the input labels > in range [0, numClasses). > * mllib {{NaiveBayesModel}} expose {{labels}} but ml did not due to the > assumption mention above. > During the copy in SPARK-14077, we use {{val labels = > data.map(_.label).distinct().collect().sorted}} to get the distinct labels > firstly, and then encode the labels for training. It involves extra Spark job > compared with the original implementation. Since {{NaiveBayes}} only do one > pass aggregation during training, add another one seems less efficient. We > can get the labels in a single pass along with {{NaiveBayes}} training and > send them to MLlib side. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org