[
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julian King updated SPARK-34591:
--------------------------------
Attachment: (was: Reproducible example of Spark bug.pdf)
> Pyspark undertakes pruning of decision trees and random forests outside the
> control of the user, leading to undesirable and unexpected outcomes that are
> challenging to diagnose and impossible to correct
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.4.0, 2.4.4, 3.1.1
> Reporter: Julian King
> Priority: Major
> Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden
> for predictions from decision trees and random forests by pruning the tree
> after fitting. This is done in such a way that branches where child leaves
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of
> the code, which is set to True as the default behaviour. However, this
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
> * Occurring always (despite the user may not wanting it to occur)
> * Not being documented in the ML documentation, leading to decision tree
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from
> the tree (not the probability prediction from the node), this introduces
> additional bias compared to the situation where the pruning is not done. The
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get
> a decision tree classifier (or random forest classifier with a single tree)
> to split a single node, despite this being eminently supported by the data.
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in
> the built trees. However, here we have found that small changes in the
> hyper-parameters lead to large and unpredictable changes in the resultant
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same
> model, with the same hyper-parameter settings, on slightly different data may
> lead to large variations in the tree structure simply as a result of the
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the
> description of the decision tree algorithm in the MLLib documents and is not
> described+
> This made it extremely confusing for us in working out why we were seeing
> certain behaviours - we had to trace back through all of the Spark detailed
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
> * Set the default pruning behaviour to False rather than True, thereby
> bringing the default behaviour back into alignment with the documentation
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
> * Leave the default pruning behaviour set to False
> * Expand the pyspark API to expose the pruning behaviour as a
> user-controllable option
> * Document the change to the API
> * Document the change to the tree building behaviour at appropriate points
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach
> is not the generally understood approach for building decision trees, where
> pruning is decided a separate and user-controllable step.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]