[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228
 ] 

RJ Nowling commented on SPARK-4894:
-----------------------------------

[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.


> Add Bernoulli-variant of Naive Bayes
> ------------------------------------
>
>                 Key: SPARK-4894
>                 URL: https://issues.apache.org/jira/browse/SPARK-4894
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to