Github user yinxusen commented on the pull request:
https://github.com/apache/spark/pull/9057#issuecomment-156391326
@selvinsource @mengxr I modified your [code of pmml export
validation](https://github.com/yinxusen/spark-pmml-exporter-validator/blob/logistic_regression_multi_class/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java#L144).
My current code can pass both Multinomial and Bernoulli cases. However, I am
very confused by the PMML definition with multinomial distribution case.
As said in the [PMML Naive Bayes
Guide](http://dmg.org/pmml/v4-2-1/NaiveBayes.html), we can see that there are
two kinds of features - categorical one and continuous one. Since we use
`LabeledPoint` as our input under the multinomial case, I believe that we
should treat each feature as a continuous input. Even though we can discretize
those continuous features into categorical ones, we cannot do it here because
it's hard to estimate the range of every input feature here with the limited
knowledge of `NaiveBayesModel`.
In the continuous setting, PMML for Naive Bayes provides two different
distributions - the Gaussian distribution and the Poisson distribution. But
neither Gaussian nor Poisson fit the multinominal case, because the scoring
procedure is different with our multi-normial scenario.
Currently, I use Gaussian distribution for continuous features, and use
`1.0` as a pseudo variance. But I am not sure the correctness.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]