[ 
https://issues.apache.org/jira/browse/SPARK-16171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16171:
----------------------------------
    Description: 
[~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
additional filters. It seems that during filter pushdown, we changed the 
ordering in the logical plan. Keeping the ordering of filters is not an SQL 
contract. So we should probably update StringIndexer implementation to make the 
filter UDF output null if the value is out of range.

{code}
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("idx")
  .setHandleInvalid("skip")
  .fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
{code}

Please see the notebook at 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
 for error messages.

  was:
[~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
additional filters. It seems that during filter pushdown, we changed the 
ordering in the logical plan. I'm not sure whether we should treat this as a 
bug.

{code}
val df1 = (0 until 3).map(_.toString).toDF
val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("idx")
  .setHandleInvalid("skip")
  .fit(df1)
val df2 = (0 until 5).map(_.toString).toDF
val predictions = indexer.transform(df2)
predictions.show() // this is okay
predictions.where('idx > 2).show() // this will throw an exception
{code}

Please see the notebook at 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
 for error messages.


> Filter UDFs in StringIndexer shouldn't throw exception
> ------------------------------------------------------
>
>                 Key: SPARK-16171
>                 URL: https://issues.apache.org/jira/browse/SPARK-16171
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>
> [~cmccubbin] reported a bug when he used StringIndexer in an ML pipeline with 
> additional filters. It seems that during filter pushdown, we changed the 
> ordering in the logical plan. Keeping the ordering of filters is not an SQL 
> contract. So we should probably update StringIndexer implementation to make 
> the filter UDF output null if the value is out of range.
> {code}
> val df1 = (0 until 3).map(_.toString).toDF
> val indexer = new StringIndexer()
>   .setInputCol("value")
>   .setOutputCol("idx")
>   .setHandleInvalid("skip")
>   .fit(df1)
> val df2 = (0 until 5).map(_.toString).toDF
> val predictions = indexer.transform(df2)
> predictions.show() // this is okay
> predictions.where('idx > 2).show() // this will throw an exception
> {code}
> Please see the notebook at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/2159162931615821/588180/latest.html
>  for error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to