[ 
https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oumar Nour updated SPARK-45154:
-------------------------------
    Labels: decisiontree pyspark3 spark2 spark3  (was: )

> Pyspark DecisionTreeClassifier: results and tree structure in spark3 very 
> different from that of the spark2 version on the same data and with the same 
> hyperparameters.
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-45154
>                 URL: https://issues.apache.org/jira/browse/SPARK-45154
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>            Reporter: Oumar Nour
>            Priority: Major
>              Labels: decisiontree, pyspark3, spark2, spark3
>
> Hello,
> I have an engine running on spark2 using a DecisionTreeClassifier model using 
> the CrossValidator. 
>  
> {code:java}
> dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
> cv_dt_evaluator = BinaryClassificationEvaluator(
>             metricName="", 
>             rawPredictionCol="probability")
> # Create param grid and cross validator for model selection
> dt_grid = ParamGridBuilder()\
>             .addGrid(
>                 dt.minInstancesPerNode, 100
>         )\
>             .addGrid(
>                 dt.maxDepth, 10
>         )\
>             .build()
> cv = CrossValidator(
>             estimator=dt, estimatorParamMaps=dt_grid, 
> evaluator=cv_dt_evaluator,
>             parallelism=4
>             numFolds=4
>         ){code}
>  
> I want to {*}migrate from spark2  to spark3{*}. I've run 
> *DecisionTreeClassifier* on the same data with the same parameter values. But 
> unfortunately my results are {*}completely different, especially in terms of 
> tree structure{*}. I have trees with less depth and fewer splits on spark3. 
> I've tried to read the documenttaion but I haven't found an answer to my 
> question.
> I read somewhere that the behavior of the *minInstancesPerNode* parameter has 
> changed and that in Spark 3, {*}minInstancesPerNode{*}(It now controls the 
> minimum number of instances per data partition in the node to create a child 
> node) no longer applies to the total number of instances in a node but rather 
> to the number of instances per partition. This change may have an impact on 
> the way the decision tree is built, particularly when working with unevenly 
> partitioned data. *IS THIS TRUE?*
> Can you help me find a solution to this problem?
> Thanks in advance for your help 
>         
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to