[
https://issues.apache.org/jira/browse/SPARK-45154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oumar Nour updated SPARK-45154:
-------------------------------
Priority: Critical (was: Major)
> Pyspark DecisionTreeClassifier: results and tree structure in spark3 very
> different from that of the spark2 version on the same data and with the same
> hyperparameters.
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-45154
> URL: https://issues.apache.org/jira/browse/SPARK-45154
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib, PySpark
> Affects Versions: 3.0.0, 3.3.1, 3.2.4, 3.3.3, 3.3.2, 3.4.0, 3.4.1
> Reporter: Oumar Nour
> Priority: Critical
> Labels: decisiontree, pyspark3, spark2, spark3
>
> Hello,
> I have an engine running on spark2 using a DecisionTreeClassifier model using
> the CrossValidator.
>
> {code:java}
> dt = DecisionTreeClassifier(maxBins=10000, seed=0)
> cv_dt_evaluator = BinaryClassificationEvaluator(
> metricName="",
> rawPredictionCol="probability")
> # Create param grid and cross validator for model selection
> dt_grid = ParamGridBuilder()\
> .addGrid(
> dt.minInstancesPerNode, 100
> )\
> .addGrid(
> dt.maxDepth, 10
> )\
> .build()
> cv = CrossValidator(
> estimator=dt, estimatorParamMaps=dt_grid,
> evaluator=cv_dt_evaluator,
> parallelism=4
> numFolds=4
> ){code}
>
> I want to {*}migrate from spark2 to spark3{*}. I've run
> *DecisionTreeClassifier* on the same data with the same parameter values. But
> unfortunately my results are {*}completely different, especially in terms of
> tree structure{*}. I have trees with less depth and fewer splits on spark3.
> I've tried to read the documenttaion but I haven't found an answer to my
> question.
> I read somewhere that the behavior of the *minInstancesPerNode* parameter has
> changed and that in Spark 3, {*}minInstancesPerNode{*}(It now controls the
> minimum number of instances per data partition in the node to create a child
> node) no longer applies to the total number of instances in a node but rather
> to the number of instances per partition. This change may have an impact on
> the way the decision tree is built, particularly when working with unevenly
> partitioned data. *IS THIS TRUE?*
> Can you help me find a solution to this problem?
> Thanks in advance for your help
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]