Oumar Nour created SPARK-45154:
----------------------------------
Summary: Pyspark DecisionTreeClassifier: results and tree
structure in spark3 very different from that of the spark2 version on the same
data and with the same hyperparameters.
Key: SPARK-45154
URL: https://issues.apache.org/jira/browse/SPARK-45154
Project: Spark
Issue Type: Bug
Components: ML, MLlib, PySpark
Affects Versions: 3.4.1, 3.4.0, 3.3.2, 3.3.3, 3.2.4, 3.3.1, 3.0.0
Reporter: Oumar Nour
Hello,
I have an engine running on spark2 using a DecisionTreeClassifier model using
the CrossValidator.
{code:java}
dt = DecisionTreeClassifier(maxBins=10000, seed=0)
cv_dt_evaluator = BinaryClassificationEvaluator(
metricName="",
rawPredictionCol="probability")
# Create param grid and cross validator for model selection
dt_grid = ParamGridBuilder()\
.addGrid(
dt.minInstancesPerNode, 100
)\
.addGrid(
dt.maxDepth, 10
)\
.build()
cv = CrossValidator(
estimator=dt, estimatorParamMaps=dt_grid, evaluator=cv_dt_evaluator,
parallelism=4
numFolds=4
){code}
I want to {*}migrate from spark2 to spark3{*}. I've run
*DecisionTreeClassifier* on the same data with the same parameter values. But
unfortunately my results are {*}completely different, especially in terms of
tree structure{*}. I have trees with less depth and fewer splits on spark3.
I've tried to read the documenttaion but I haven't found an answer to my
question.
I read somewhere that the behavior of the *minInstancesPerNode* parameter has
changed and that in Spark 3, *minInstancesPerNode* no longer applies to the
total number of instances in a node but rather to the number of instances per
partition. This change may have an impact on the way the decision tree is
built, particularly when working with unevenly partitioned data. *IS THIS TRUE?*
Can you help me find a solution to this problem?
Thanks in advance for your help
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]