Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19516#discussion_r146531755
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
    @@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml] (
         val featureAttributes: Array[Attribute] = if 
(origAttrGroup.attributes.nonEmpty) {
           origAttrGroup.attributes.get.zipWithIndex.filter(x => 
selector.contains(x._2)).map(_._1)
         } else {
    -      Array.fill[Attribute](selector.size)(NominalAttribute.defaultAttr)
    +      null
    --- End diff --
    
    Yes I admit it is hard to get the `values` and/or `numValues` here.
    Current spark code will throw exception when we pipeline ChiSqSelector + 
DecisionTreeClassifier (on features without attributes),
    But if we remove the code adding `Nominal` here (in ChiSqSelector), 
although pipeline ChiSqSelector + DecisionTreeClassifier(on features without 
attributes) will run successfully, it will get wrong result, because 
DecisionTreeClassifier treat them as continuous features.
    Comparing explicitly throwing exception and running successfully on the 
surface(but internal is wrong), I tend to keep current choice.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to