Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/9057#issuecomment-156391326
  
    @selvinsource @mengxr I modified your [code of pmml export 
validation](https://github.com/yinxusen/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java#L144).
 My current code can pass both Multinomial and Bernoulli cases. However, I am 
very confused by the PMML definition with multinomial distribution case.
    
    As said in the [PMML Naive Bayes 
Guide](http://dmg.org/pmml/v4-2-1/NaiveBayes.html), we can see that there are 
two kinds of features - categorical one and continuous one. Since we use 
`LabeledPoint` as our input under the multinomial case, I believe that we 
should treat each feature as a continuous input. Even though we can discretize 
those continuous features into categorical ones, we cannot do it here because 
it's hard to estimate the range of every input feature here with the limited 
knowledge of `NaiveBayesModel`.
    
    In the continuous setting, PMML for Naive Bayes provides two different 
distributions - the Gaussian distribution and the Poisson distribution. But 
neither Gaussian nor Poisson fit the multinominal case, because the scoring 
procedure is different with our multi-normial scenario. 
    
    Currently, I use Gaussian distribution for continuous features, and use 
`1.0` as a pseudo variance. But I am not sure the correctness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to