Github user selvinsource commented on the pull request:
https://github.com/apache/spark/pull/7842#issuecomment-133817289
Here some initial results from my tests (trying to load the exported xml
into JPMML for evaluation).
**Node with no predicate**
If you look at
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/decision_trees_validator
and follow the Decision Tree (Classification) instructions
you get
[code]
java -jar
target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar
DecisionTreeClassificationModel
DecisionTreeClassificationModel selected
Exception in thread "main" org.jpmml.manager.InvalidFeatureException (at or
around line 30): Node
at
org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:155)
at
org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:186)
[/code]
The reason being that the node 32 has no predicate (see generated xml
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/resources/exported_pmml_models/decisiontree_classification.xml).
If you look at the documentation
http://www.dmg.org/v4-2-1/TreeModel.html#xsdGroup_PREDICATE, it says "Each
Node has one PREDICATE; that may be a SimplePredicate, a SetPredicate, a
CompoundPredicate, a True, or a False.".
In this case, node 32, I assume we need True:
<Node id="32" score="0.0"><True/></Node>
This should apply to all other nodes with no predicate.
Once fixed, please try to run the code above and verify the evaluation is
correct by running
java -jar
target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar
DecisionTreeClassificationModel
You should get Class 0 for the first item and Class 1 for the second as the
scala code
(https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/resources/spark_shell_exporter/decisiontree_breastcancerwisconsin.scala).
**Data Fields and Mining Fields**
I noticed that in order to evaluate the model in JPMML I need to provide
only a subset of the fields and in this order: field_0, field_5, field_7,
field_1. This is inconsistent with the scala code (Spark) whereas for
prediction I still provide the features in the same order and number of fields
of the training data.
It is not a big deal, but it would be nice to have a consistent behavior to
the spark prediction. Also over time the exported decision tree can change the
order of fields and the number of fields therefore the application using JPMML
needs to be changed every time to match the number of fields and their order.
It would be easier if the exported model has all the fields in the normal order
field_0, field_1, field_2... so that as input we can still provide the full
list of fields (however some will be ignored by the decision tree but that is
fine). The advantage is that over time no code change is required to the
decoupled application using JPMML and the same input vector of the one we use
for training the model.
@mengxr What is your opinion on this point?
**Decision Tree Regression**
Once the above points have been addressed I will do some check on the
regression model, if you look at
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java
it is marked as TODO.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]