Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146531755
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
val featureAttributes: Array[Attribute] = if
(origAttrGroup.attributes.nonEmpty) {
origAttrGroup.attributes.get.zipWithIndex.filter(x =>
selector.contains(x._2)).map(_._1)
} else {
- Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
+ null
--- End diff --
Yes I admit it is hard to get the `values` and/or `numValues` here.
Current spark code will throw exception when we pipeline ChiSqSelector +
DecisionTreeClassifier (on features without attributes),
But if we remove the code adding `Nominal` here (in ChiSqSelector),
although pipeline ChiSqSelector + DecisionTreeClassifier(on features without
attributes) will run successfully, it will get wrong result, because
DecisionTreeClassifier treat them as continuous features.
Comparing explicitly throwing exception and running successfully on the
surface(but internal is wrong), I tend to keep current choice.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]