Github user mbautin commented on a diff in the pull request:
https://github.com/apache/spark/pull/9308#discussion_r45291382
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
---
@@ -1870,4 +1870,19 @@ class SQLQuerySuite extends QueryTest with
SharedSQLContext {
assert(sampled.count() == sampledOdd.count() + sampledEven.count())
}
}
+
+ test("SPARK-10707: nullability should be correctly propagated through
set operations") {
+ withTempTable("src") {
+ Seq((1, 1)).toDF("key", "value").registerTempTable("src")
+ checkAnswer(
+ sql("""SELECT count(v) FROM (
+ | SELECT v FROM (
+ | SELECT 'foo' AS v FROM src UNION ALL
+ | SELECT NULL AS v FROM src
+ | ) my_union WHERE isnull(v)
+ |) my_subview""".stripMargin),
+ Seq(Row(0)))
--- End diff --
@cloud-fan:
My original test case produced 0 incorrectly because of this behavior of
the `NullPropagation` rule:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.NullPropagation
===
!Aggregate [COUNT(v#12) AS _c0#14L]
Aggregate [COUNT(1) AS _c0#14L]
Union
Union
Project [foo AS v#12]
Project [foo AS v#12]
Filter
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(foo)
Filter
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(foo)
MetastoreRelation default, dual, None
MetastoreRelation default, dual, None
Project [CAST(null AS v#13, StringType) AS v#17]
Project [CAST(null AS v#13, StringType) AS
v#17]
Filter
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(CAST(null,
StringType)) Filter
HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNull(CAST(null,
StringType))
MetastoreRelation default, dual, None
MetastoreRelation default, dual, None
```
(see https://gist.githubusercontent.com/mbautin/c916a2a7ce733d039137/raw
for the complete log).
COUNT(v) got replaced with COUNT(1) by the following code from
`NullPropagation` because the output column of UNION was incorrectly considered
non-nullable:
```scala
case e @ AggregateExpression(Count(expr), mode, false) if
!expr.nullable =>
// This rule should be only triggered when isDistinct field is
false.
AggregateExpression(Count(Literal(1)), mode, isDistinct = false)
```
That is why the query returned 1 instead of 0.
I'll check if your test also reproduces the bug. I think it would be great
to keep both tests just in case.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]