[jira] [Updated] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

Oumar Nour (Jira) Mon, 25 Sep 2023 09:15:00 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Oumar Nour updated SPARK-45154:
-------------------------------
    Description: 
Hello,
I have an engine running on spark2 using a DecisionTreeClassifier model using 
the CrossValidator. 

 
{code:java}
dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
cv_dt_evaluator = BinaryClassificationEvaluator(
            metricName="", 
            rawPredictionCol="probability")

# Create param grid and cross validator for model selection
dt_grid = ParamGridBuilder()\
            .addGrid(
                dt.minInstancesPerNode, [100]
        )\
            .addGrid(
                dt.maxDepth, [10]
        )\
            .build()
cv = CrossValidator(
            estimator=dt, estimatorParamMaps=dt_grid, evaluator=cv_dt_evaluator,
            parallelism=4
            numFolds=4
        ){code}
 

I want to {*}migrate from spark2  to spark3{*}. I've run 
*DecisionTreeClassifier* on the same data with the same parameter values. But 
unfortunately my results are {*}completely different, especially in terms of 
tree structure{*}. I have trees with less depth and fewer splits on spark3. 
I've tried to read the documentation but I haven't found an answer to my 
question.

 

Can you help me find a solution to this problem?

Thanks in advance for your help 

        

 

  was:
Hello,
I have an engine running on spark2 using a DecisionTreeClassifier model using 
the CrossValidator. 

 
{code:java}
dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
cv_dt_evaluator = BinaryClassificationEvaluator(
            metricName="", 
            rawPredictionCol="probability")

# Create param grid and cross validator for model selection
dt_grid = ParamGridBuilder()\
            .addGrid(
                dt.minInstancesPerNode, [100]
        )\
            .addGrid(
                dt.maxDepth, [10]
        )\
            .build()
cv = CrossValidator(
            estimator=dt, estimatorParamMaps=dt_grid, evaluator=cv_dt_evaluator,
            parallelism=4
            numFolds=4
        ){code}
 

I want to {*}migrate from spark2  to spark3{*}. I've run 
*DecisionTreeClassifier* on the same data with the same parameter values. But 
unfortunately my results are {*}completely different, especially in terms of 
tree structure{*}. I have trees with less depth and fewer splits on spark3. 
I've tried to read the documenttaion but I haven't found an answer to my 
question.

I read somewhere that the behavior of the *minInstancesPerNode* parameter has 
changed and that in Spark 3, {*}minInstancesPerNode{*}(It now controls the 
minimum number of instances per data partition in the node to create a child 
node) no longer applies to the total number of instances in a node but rather 
to the number of instances per partition. This change may have an impact on the 
way the decision tree is built, particularly when working with unevenly 
partitioned data. *IS THIS TRUE?*

Can you help me find a solution to this problem?

Thanks in advance for your help 

        

 


> Pyspark DecisionTreeClassifier: results and tree structure in spark3 very 
> different from that of the spark2 version on the same data and with the same 
> hyperparameters.
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-45154
>                 URL: https://issues.apache.org/jira/browse/SPARK-45154
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark, Spark Core
>    Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>            Reporter: Oumar Nour
>            Priority: Critical
>              Labels: decisiontree, pyspark3, spark2, spark3
>
> Hello,
> I have an engine running on spark2 using a DecisionTreeClassifier model using 
> the CrossValidator. 
>  
> {code:java}
> dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
> cv_dt_evaluator = BinaryClassificationEvaluator(
>             metricName="", 
>             rawPredictionCol="probability")
> # Create param grid and cross validator for model selection
> dt_grid = ParamGridBuilder()\
>             .addGrid(
>                 dt.minInstancesPerNode, [100]
>         )\
>             .addGrid(
>                 dt.maxDepth, [10]
>         )\
>             .build()
> cv = CrossValidator(
>             estimator=dt, estimatorParamMaps=dt_grid, 
> evaluator=cv_dt_evaluator,
>             parallelism=4
>             numFolds=4
>         ){code}
>  
> I want to {*}migrate from spark2  to spark3{*}. I've run 
> *DecisionTreeClassifier* on the same data with the same parameter values. But 
> unfortunately my results are {*}completely different, especially in terms of 
> tree structure{*}. I have trees with less depth and fewer splits on spark3. 
> I've tried to read the documentation but I haven't found an answer to my 
> question.
>  
> Can you help me find a solution to this problem?
> Thanks in advance for your help 
>         
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-45154) Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

Reply via email to