[
https://issues.apache.org/jira/browse/SPARK-22295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209164#comment-16209164
]
Peng Meng commented on SPARK-22295:
-----------------------------------
Please use labelCol=" class", not labelCol="class".
> Chi Square selector not recognizing field in Data frame
> -------------------------------------------------------
>
> Key: SPARK-22295
> URL: https://issues.apache.org/jira/browse/SPARK-22295
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 2.1.1
> Reporter: Cheburakshu
>
> ChiSquare selector is not recognizing the field 'class' which is present in
> the data frame while fitting the model. I am using PIMA Indians diabetes
> dataset of UCI. The complete code and output is below for reference. But,
> when some rows of the input file is created as a dataframe manually, it will
> work. Couldn't understand the pattern here.
> Kindly help.
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', '
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
> outputCol="selected", labelCol='class').fit(df)
> except:
> print(sys.exc_info())
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+------+
> |preg| plas| pres| skin| test| mass| pedi| age| class|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|
> +----+-----+-----+-----+-----+-----+-----+----+------+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age| class| features|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+------+--------------------+
> only showing top 1 row
> (<class 'pyspark.sql.utils.IllegalArgumentException'>,
> IllegalArgumentException('Field "class" does not exist.',
> 'org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
> at
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)\n\t
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)\n\t at
> scala.collection.AbstractMap.getOrElse(Map.scala:59)\n\t at
> org.apache.spark.sql.types.StructType.apply(StructType.scala:263)\n\t at
> org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)\n\t
> at
> org.apache.spark.ml.feature.ChiSqSelector.transformSchema(ChiSqSelector.scala:183)\n\t
> at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)\n\t
> at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:159)\n\t
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\t at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\t
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\t
> at java.lang.reflect.Method.invoke(Method.java:498)\n\t at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\t at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\t at
> py4j.Gateway.invoke(Gateway.java:280)\n\t at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\t at
> py4j.commands.CallCommand.execute(CallCommand.java:79)\n\t at
> py4j.GatewayConnection.run(GatewayConnection.java:214)\n\t at
> java.lang.Thread.run(Thread.java:745)'), <traceback object at 0x0B743BC0>)
> *The below code works fine:
> *
> {code:python}
> from pyspark.ml.feature import VectorAssembler, ChiSqSelector
> import sys
> file_name='data/pima-indians-diabetes.data'
> #df=spark.read.format("csv").option("inferSchema","true").option("header","true").load(file_name).cache()
> # Just pasted a few rows from the input file and created a data frome. This
> will work, but not the frame picked up from the file
> df = spark.createDataFrame([
> [6,148,72,35,0,33.6,0.627,50,1],
> [1,85,66,29,0,26.6,0.351,31,0],
> [8,183,64,0,0,23.3,0.672,32,1],
> ], ['preg', ' plas', ' pres', ' skin', ' test', ' mass', ' pedi', ' age',
> "class"])
> df.show(1)
> assembler = VectorAssembler(inputCols=['preg', ' plas', ' pres', ' skin', '
> test', ' mass', ' pedi', ' age'],outputCol="features")
> df=assembler.transform(df)
> df.show(1)
> try:
> css=ChiSqSelector(numTopFeatures=5, featuresCol="features",
> outputCol="selected", labelCol="class").fit(df)
> except:
> print(sys.exc_info())
> print(css.selectedFeatures)
> {code}
> Output:
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> |preg| plas| pres| skin| test| mass| pedi| age|class|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|
> +----+-----+-----+-----+-----+-----+-----+----+-----+
> only showing top 1 row
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> |preg| plas| pres| skin| test| mass| pedi| age|class| features|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> | 6| 148| 72| 35| 0| 33.6|0.627| 50| 1|[6.0,148.0,72.0,3...|
> +----+-----+-----+-----+-----+-----+-----+----+-----+--------------------+
> only showing top 1 row
> [0, 1, 2, 3, 5]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]