Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/16722#discussion_r99670310
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
---
@@ -106,14 +122,18 @@ class DecisionTreeClassifier @Since("1.4.0") (
".train() called with non-matching numClasses and
thresholds.length." +
s" numClasses=$numClasses, but thresholds has length
${$(thresholds).length}")
}
-
- val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset,
numClasses)
--- End diff --
For regressors, `extractLabeledPoints` doesn't do any extra checking. The
larger issue is that we are manually "extracting instances" but we have
convenience methods for labeled points. Since correcting it now, in this PR,
likely means implementing the framework to correct it everywhere - which is a
larger and orthogonal change, I think we could just add the check manually to
the classifier, then create a JIRA that addresses consolidating these, probably
by adding `extractInstances` methods analogous their labeled point
counterparts. This PR is large enough as is, without having to think about
adding that method, then implementing it in all the other algos that manually
extract instances, IMO.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]