Xiangrui Meng created SPARK-16171:
-------------------------------------
Summary: Filter UDFs in StringIndexer shouldn't throw exception
Key: SPARK-16171
URL: https://issues.apache.org/jira/browse/SPARK-16171
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
[~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with
additional filters. It seems that during filter pushdown, we changed the
ordering in the logical plan. I'm not sure whether we should treat this as a
bug.
{code}
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("idx")
.setHandleInvalid("skip")
.fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
{code}
Please see the notebook at
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
for error messages.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]