[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228 ]
RJ Nowling commented on SPARK-4894: ----------------------------------- [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. What do you think? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. > Add Bernoulli-variant of Naive Bayes > ------------------------------------ > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.2.0 > Reporter: RJ Nowling > Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org