[GitHub] spark pull request: [SPARK-8542][MLlib]PMML export for Decision Tr...

selvinsource Sun, 23 Aug 2015 04:02:21 -0700

Github user selvinsource commented on the pull request:

    https://github.com/apache/spark/pull/7842#issuecomment-133817289
  
    Here some initial results from my tests (trying to load the exported xml 
into JPMML for evaluation).
    
    **Node with no predicate**
    If you look at 
    
https://github.com/selvinsource/spark-pmml-exporter-validator/tree/decision_trees_validator
    and follow the Decision Tree (Classification) instructions
    you get
    [code]
    java -jar 
target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar 
DecisionTreeClassificationModel
    DecisionTreeClassificationModel selected
    Exception in thread "main" org.jpmml.manager.InvalidFeatureException (at or 
around line 30): Node
        at 
org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:155)
        at 
org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:186)
    [/code]
    
    The reason being that the node 32 has no predicate (see generated xml 
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/resources/exported_pmml_models/decisiontree_classification.xml).
 
    
    If you look at the documentation
    http://www.dmg.org/v4-2-1/TreeModel.html#xsdGroup_PREDICATE, it says "Each 
Node has one PREDICATE; that may be a SimplePredicate, a SetPredicate, a 
CompoundPredicate, a True, or a False.".
    
    In this case, node 32, I assume we need True:
        <Node id="32" score="0.0"><True/></Node>
    
    This should apply to all other nodes with no predicate.
    
    Once fixed, please try to run the code above and verify the evaluation is 
correct by running
        java -jar 
target/spark-pmml-exporter-validator-1.1.0-SNAPSHOT-jar-with-dependencies.jar 
DecisionTreeClassificationModel
    You should get Class 0 for the first item and Class 1 for the second as the 
scala code 
(https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/resources/spark_shell_exporter/decisiontree_breastcancerwisconsin.scala).
    
    **Data Fields and Mining Fields**
    I noticed that in order to evaluate the model in JPMML I need to provide 
only a subset of the fields and in this order: field_0, field_5, field_7, 
field_1. This is inconsistent with the scala code (Spark) whereas for 
prediction I still provide the features in the same order and number of fields 
of the training data.
    It is not a big deal, but it would be nice to have a consistent behavior to 
the spark prediction. Also over time the exported decision tree can change the 
order of fields and the number of fields therefore the application using JPMML 
needs to be changed every time to match the number of fields and their order. 
It would be easier if the exported model has all the fields in the normal order 
field_0, field_1, field_2... so that as input we can still provide the full 
list of fields (however some will be ignored by the decision tree but that is 
fine). The advantage is that over time no code change is required to the 
decoupled application using JPMML and the same input vector of the one we use 
for training the model.
    @mengxr What is your opinion on this point?
    
    **Decision Tree Regression**
    Once the above points have been addressed I will do some check on the 
regression model, if you look at 
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/decision_trees_validator/src/main/java/org/selvinsource/spark_pmml_exporter_validator/SparkPMMLExporterValidator.java
 it is marked as TODO.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8542][MLlib]PMML export for Decision Tr...

Reply via email to