Github user mbautin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9308#discussion_r45291382
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
    @@ -1870,4 +1870,19 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
           assert(sampled.count() == sampledOdd.count() + sampledEven.count())
         }
       }
    +
    +  test("SPARK-10707: nullability should be correctly propagated through 
set operations") {
    +    withTempTable("src") {
    +      Seq((1, 1)).toDF("key", "value").registerTempTable("src")
    +      checkAnswer(
    +        sql("""SELECT count(v) FROM (
    +              |  SELECT v FROM (
    +              |    SELECT 'foo' AS v FROM src UNION ALL
    +              |    SELECT NULL AS v FROM src
    +              |  ) my_union WHERE isnull(v)
    +              |) my_subview""".stripMargin),
    +        Seq(Row(0)))
    --- End diff --
    
    @cloud-fan:
    
    My original test case produced 0 incorrectly because of this behavior of 
the `NullPropagation` rule:
    
    ```
    === Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation 
===
    !Aggregate [COUNT(v#12) AS _c0#14L]                                         
                               Aggregate [COUNT(1) AS _c0#14L]
      Union                                                                     
                                Union
       Project [foo AS v#12]                                                    
                                 Project [foo AS v#12]
        Filter 
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(foo)      
                   Filter 
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(foo)
         MetastoreRelation default, dual, None                                  
                                   MetastoreRelation default, dual, None
       Project [CAST(null AS v#13, StringType) AS v#17]                         
                                 Project [CAST(null AS v#13, StringType) AS 
v#17]
        Filter 
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(CAST(null,
 StringType))      Filter 
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(CAST(null,
 StringType))
         MetastoreRelation default, dual, None                                  
                                   MetastoreRelation default, dual, None
    ```
    (see https://gist.githubusercontent.com/mbautin/c916a2a7ce733d039137/raw 
for the complete log).
    
    COUNT(v) got replaced with COUNT(1) by the following code from 
`NullPropagation` because the output column of UNION was incorrectly considered 
non-nullable:
    
    ```scala
          case e @ AggregateExpression(Count(expr), mode, false) if 
!expr.nullable =>
            // This rule should be only triggered when isDistinct field is 
false.
            AggregateExpression(Count(Literal(1)), mode, isDistinct = false)
    ```
    
    That is why the query returned 1 instead of 0.
    
    I'll check if your test also reproduces the bug. I think it would be great 
to keep both tests just in case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to