[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

Peng Meng (JIRA) Tue, 24 Oct 2017 18:50:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217979#comment-16217979
 ]


Peng Meng commented on SPARK-22277:
-----------------------------------

For problem 1 and 2, could you please post the test code. 
For problem 1, one possible case is all the feature ChiSquare statistics value 
is the same, no matter you select which feature, the result is right. 
To code is helpful for analysis of the problem, 

> Chi Square selector garbling Vector content.
> --------------------------------------------
>
>                 Key: SPARK-22277
>                 URL: https://issues.apache.org/jira/browse/SPARK-22277
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Cheburakshu
>
> There is a difference in behavior when Chisquare selector is used v direct 
> feature use in decision tree classifier. 
> In the below code, I have used chisquare selector as a thru' pass but the 
> decision tree classifier is unable to process it. But, it is able to process 
> when the features are used directly.
> The example is pulled out directly from Apache spark python documentation.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import ChiSqSelector
> from pyspark.ml.linalg import Vectors
> import sys
> df = spark.createDataFrame([
>     (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
>     (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
>     (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
> "clicked"])
> # ChiSq selector will just be a pass-through. All four featuresin the i/p 
> will be in output also.
> selector = ChiSqSelector(numTopFeatures=4, featuresCol="features",
>                      outputCol="selectedFeatures", labelCol="clicked")
> result = selector.fit(df).transform(df)
> print("ChiSqSelector output with top %d features selected" % 
> selector.getNumTopFeatures())
> from pyspark.ml.classification import DecisionTreeClassifier
> try:
> # Fails
>     dt = 
> DecisionTreeClassifier(labelCol="clicked",featuresCol="selectedFeatures")
>     model = dt.fit(result)
> except:
>     print(sys.exc_info())
> #Works    
>     dt = DecisionTreeClassifier(labelCol="clicked",featuresCol="features")
>     model = dt.fit(df)
>     
> # Make predictions. Using same dataset, not splitting!!
> predictions = model.transform(result)
> # Select example rows to display.
> predictions.select("prediction", "clicked", "features").show(5)
> # Select (prediction, true label) and compute test error
> evaluator = MulticlassClassificationEvaluator(
>     labelCol="clicked", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Test Error = %g " % (1.0 - accuracy))
> {code}
> Output:
> ChiSqSelector output with top 4 features selected
> (<class 'pyspark.sql.utils.IllegalArgumentException'>, 
> IllegalArgumentException('Feature 0 is marked as Nominal (categorical), but 
> it does not have the number of values specified.', 
> 'org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:69)\n\t
>  at 
> org.apache.spark.ml.util.MetadataUtils$$anonfun$getCategoricalFeatures$1.apply(MetadataUtils.scala:59)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)\n\t
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)\n\t
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)\n\t 
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)\n\t 
> at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)\n\t at 
> org.apache.spark.ml.util.MetadataUtils$.getCategoricalFeatures(MetadataUtils.scala:59)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:101)\n\t
>  at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)\n\t
>  at org.apache.spark.ml.Predictor.fit(Predictor.scala:96)\n\t at 
> org.apache.spark.ml.Predictor.fit(Predictor.scala:72)\n\t at 
> sun.reflect.GeneratedMethodAccessor280.invoke(Unknown Source)\n\t at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
>  at java.lang.reflect.Method.invoke(Method.java:498)\n\t at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at 
> py4j.Gateway.invoke(Gateway.java:280)\n\t at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at 
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at 
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at 
> java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0A87D878>)
> +----------+-------+------------------+
> |prediction|clicked|          features|
> +----------+-------+------------------+
> |       1.0|    1.0|[0.0,0.0,18.0,1.0]|
> |       0.0|    0.0|[0.0,1.0,12.0,0.0]|
> |       0.0|    0.0|[1.0,0.0,15.0,0.1]|
> +----------+-------+------------------+
> Test Error = 0 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-22277) Chi Square selector garbling Vector content.

Reply via email to