[ https://issues.apache.org/jira/browse/SPARK-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng deleted SPARK-16171: ---------------------------------- > Filter UDFs in StringIndexer shouldn't throw exception > ------------------------------------------------------ > > Key: SPARK-16171 > URL: https://issues.apache.org/jira/browse/SPARK-16171 > Project: Spark > Issue Type: Bug > Reporter: Xiangrui Meng > > [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with > additional filters. It seems that during filter pushdown, we changed the > ordering in the logical plan. Keeping the ordering of filters is not an SQL > contract. So we should probably update StringIndexer implementation to make > the filter UDF output null if the value is out of range. > {code} > val df1 = (0 until 3).map(_.toString).toDF > val indexer = new StringIndexer() > .setInputCol("value") > .setOutputCol("idx") > .setHandleInvalid("skip") > .fit(df1) > val df2 = (0 until 5).map(_.toString).toDF > val predictions = indexer.transform(df2) > predictions.show() // this is okay > predictions.where('idx > 2).show() // this will throw an exception > {code} > Please see the notebook at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html > for error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org