[ 
https://issues.apache.org/jira/browse/SPARK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541766#comment-14541766
 ] 

Villu Ruusmann commented on SPARK-7540:
---------------------------------------

There are two kinds of tests that should be considered:

1) Static checks against the PMML XSD file. This will ensure that Spark's PMML 
exporter is initializing all the required elements and attributes, that 
attribute values are within valid ranges etc. The 1.2.X versions of the 
JPMML-Model library JAR file include a suitable PMML XSD resource as /pmml.xsd, 
which is accessible via the method org.jpmml.model.JAXBContext#getSchema(). For 
a full example, see here: 
https://github.com/jpmml/jpmml-model/blob/master/pmml-model-example/src/main/java/org/jpmml/model/ValidationExample.java

2) Dynamic checks using real data (ie. if the PMML model is executed with an 
input data record X, then the result equals to an output data record Y). The 
JPMML-Evaluator library includes an utility class org.jpmml.evaluator.BatchUtil 
that you can use to set up simple yet powerful batch validation workflows 
(scores a PMML file with the specified input CSV file, and ensures that the 
results equal to the specified output CSV file). I'm using this approach to 
test against other open-source PMML producer software like R/Rattle, KNIME etc. 
It is likely that future versions of JPMML-Evaluator library will include a 
separate integration tests module for Spark as well.

The validation code by [~selvinsource] is similar to the BatchUtil utility 
class. However,  the latter is functionally more complete (eg. validating 
floating-point precisions using customizable precision criteria) and requires 
less coding when introducing new model types or model instances.

> PMML correctness check
> ----------------------
>
>                 Key: SPARK-7540
>                 URL: https://issues.apache.org/jira/browse/SPARK-7540
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Shuo Xiang
>
> Check correctness of PMML export for MLlib models by using PMML evaluator to 
> load and run the models.  This unfortunately needs to be done externally (not 
> in spark-perf) because of licensing.  A record of tests run and the results 
> can be posted in this JIRA, as well as a link to the repo hosting the testing 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to