zhengruifeng commented on a change in pull request #26413: [SPARK-16872][ML][PYSPARK] Impl Gaussian Naive Bayes Classifier URL: https://github.com/apache/spark/pull/26413#discussion_r344431951
########## File path: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ########## @@ -248,19 +344,24 @@ object NaiveBayes extends DefaultParamsReadable[NaiveBayes] { /** * Model produced by [[NaiveBayes]] - * @param pi log of class priors, whose dimension is C (number of classes) + * + * @param pi log of class priors, whose dimension is C (number of classes) * @param theta log of class conditional probabilities, whose dimension is C (number of classes) * by D (number of features) + * @param sigma variance of each feature, whose dimension is C (number of classes) + * by D (number of features). This matrix is only available when modelType + * is set Gaussian. */ @Since("1.5.0") class NaiveBayesModel private[ml] ( @Since("1.5.0") override val uid: String, @Since("2.0.0") val pi: Vector, - @Since("2.0.0") val theta: Matrix) + @Since("2.0.0") val theta: Matrix, + @Since("3.0.0") val sigma: Matrix) Review comment: 1, I have try make a subclass `GaussianNaiveBayesModel`, I think it will involve too much complexity in usage and impl: we have to explictly assign the subclass `GaussianNaiveBayesModel` in some way if we need sigma matrix. I am not sure whether to define a new object `object GaussianNaiveBayesModel extends MLReadable[GaussianNaiveBayesModel]` and impl write/read method in it. ```scala val model: NaiveBayesModel = new NaiveBayes().setModelType("gaussian").fit(df) val sigma = model.asInstanceOf[GaussianNaiveBayesModel].sigma ``` 2, It is ok to use `Option[Matrix]` here, however it include a little complexity for pyspark: I have to define a helper function in the scala side: ```scala // helper function for pyspark, since python do not have option type. private[spark] def pySigma: Matrix = sigma.get ``` and in the py side ```python @property @since("3.0.0") def sigma(self): """ variance of each feature. """ if self.getModelType() == "gaussian": return self._call_java("pySigma") else: return None ``` That is because it seems that scala's `Option` type can not be converted to a python object automatically. 3, otherwise we may create an empty matrix for Multinomial & Bernoulli. 4, just contine use `null` sigma. Among above 4 approaches, I prefer to 3&4. How do you think about it? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org