[
https://issues.apache.org/jira/browse/SPARK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541766#comment-14541766
]
Villu Ruusmann commented on SPARK-7540:
---------------------------------------
There are two kinds of tests that should be considered:
1) Static checks against the PMML XSD file. This will ensure that Spark's PMML
exporter is initializing all the required elements and attributes, that
attribute values are within valid ranges etc. The 1.2.X versions of the
JPMML-Model library JAR file include a suitable PMML XSD resource as /pmml.xsd,
which is accessible via the method org.jpmml.model.JAXBContext#getSchema(). For
a full example, see here:
https://github.com/jpmml/jpmml-model/blob/master/pmml-model-example/src/main/java/org/jpmml/model/ValidationExample.java
2) Dynamic checks using real data (ie. if the PMML model is executed with an
input data record X, then the result equals to an output data record Y). The
JPMML-Evaluator library includes an utility class org.jpmml.evaluator.BatchUtil
that you can use to set up simple yet powerful batch validation workflows
(scores a PMML file with the specified input CSV file, and ensures that the
results equal to the specified output CSV file). I'm using this approach to
test against other open-source PMML producer software like R/Rattle, KNIME etc.
It is likely that future versions of JPMML-Evaluator library will include a
separate integration tests module for Spark as well.
The validation code by [~selvinsource] is similar to the BatchUtil utility
class. However, the latter is functionally more complete (eg. validating
floating-point precisions using customizable precision criteria) and requires
less coding when introducing new model types or model instances.
> PMML correctness check
> ----------------------
>
> Key: SPARK-7540
> URL: https://issues.apache.org/jira/browse/SPARK-7540
> Project: Spark
> Issue Type: Sub-task
> Components: MLlib
> Reporter: Joseph K. Bradley
> Assignee: Shuo Xiang
>
> Check correctness of PMML export for MLlib models by using PMML evaluator to
> load and run the models. This unfortunately needs to be done externally (not
> in spark-perf) because of licensing. A record of tests run and the results
> can be posted in this JIRA, as well as a link to the repo hosting the testing
> code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]