[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370046#comment-16370046
 ] 

Marco Gaido commented on SPARK-23463:
-------------------------------------

Hi [~m.bakshi11]. The problem is very easy. The column `val` in `df` is a 
string. When you filter it comparing to an integer, it is casted to a integer 
for the comparison. The problem here is that you are thinking that you are 
dealing with doubles, while instead you have strings. You should fix your code 
in oder to parse the column `val` as a double and you won't have this issue 
anymore. Thanks.

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-23463
>                 URL: https://issues.apache.org/jira/browse/SPARK-23463
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.2.1
>            Reporter: Manan Bakshi
>            Priority: Critical
>         Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to