zhengruifeng commented on a change in pull request #26413: 
[SPARK-16872][ML][PYSPARK] Impl Gaussian Naive Bayes Classifier
URL: https://github.com/apache/spark/pull/26413#discussion_r344431951
 
 

 ##########
 File path: 
mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala
 ##########
 @@ -248,19 +344,24 @@ object NaiveBayes extends 
DefaultParamsReadable[NaiveBayes] {
 
 /**
  * Model produced by [[NaiveBayes]]
- * @param pi log of class priors, whose dimension is C (number of classes)
+ *
+ * @param pi    log of class priors, whose dimension is C (number of classes)
  * @param theta log of class conditional probabilities, whose dimension is C 
(number of classes)
  *              by D (number of features)
+ * @param sigma variance of each feature, whose dimension is C (number of 
classes)
+ *              by D (number of features). This matrix is only available when 
modelType
+ *              is set Gaussian.
  */
 @Since("1.5.0")
 class NaiveBayesModel private[ml] (
     @Since("1.5.0") override val uid: String,
     @Since("2.0.0") val pi: Vector,
-    @Since("2.0.0") val theta: Matrix)
+    @Since("2.0.0") val theta: Matrix,
+    @Since("3.0.0") val sigma: Matrix)
 
 Review comment:
   1, I have try make a subclass `GaussianNaiveBayesModel`, I think it will 
involve too much complexity in usage and impl: 
   we have to explictly assign the subclass `GaussianNaiveBayesModel` in some 
way if we need sigma matrix.
   I am not sure whether to define a new object `object GaussianNaiveBayesModel 
extends MLReadable[GaussianNaiveBayesModel]` and impl write/read method in it.
   ```scala
   val model: NaiveBayesModel = new 
NaiveBayes().setModelType("gaussian").fit(df)
   val sigma = model.asInstanceOf[GaussianNaiveBayesModel].sigma
   ```
   
   2, It is ok to use `Option[Matrix]` here, however it include a little 
complexity for pyspark:
    I have to define a helper function in the scala side:
   ```scala
     // helper function for pyspark, since python do not have option type.
     private[spark] def pySigma: Matrix = sigma.get
   ```
   and in the py side
   ```python
       @property
       @since("3.0.0")
       def sigma(self):
           """
           variance of each feature.
           """
           if self.getModelType() == "gaussian":
               return self._call_java("pySigma")
           else:
               return None
   ```
   That is because it seems that scala's `Option` type can not be converted to 
a python object automatically.
   
   3, otherwise we may create an empty matrix for Multinomial & Bernoulli.
   
   4, just contine use `null` sigma.
   
   Among above 4 approaches, I prefer to 3&4. How do you think about it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to