[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211239#comment-16211239
 ] 

Marco Gaido commented on SPARK-22307:
-------------------------------------

Have you checked if the missing records contain null as a value for `col1`? If 
so, there is no bug and this is an expected behavior according to SQL 
standards, since operations involving nulls are evaluated to null and null is 
considered false in conditions. Thus nulls are filtered in both cases correctly.

> NOT condition working incorrectly
> ---------------------------------
>
>                 Key: SPARK-22307
>                 URL: https://issues.apache.org/jira/browse/SPARK-22307
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: Andrey Yakovenko
>         Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  |    |-- Id: string (nullable = true)
>  |    |-- Name: string (nullable = true)
>  |    |-- Parent: struct (nullable = true)
>  |    |    |-- Id: string (nullable = true)
>  |    |    |-- Name: string (nullable = true)
>  |    |    |-- Parent: struct (nullable = true)
>  |    |    |    |-- Id: string (nullable = true)
>  |    |    |    |-- Name: string (nullable = true)
>  |    |    |    |-- Parent: string (nullable = true)
>  |    |    |    |-- SKU: string (nullable = true)
>  |    |    |-- SKU: string (nullable = true)
>  |    |-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("((((Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = ((((Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to