Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/3676#issuecomment-66719303
This LGTM, but would like to share some findings related to semantics of
`COUNT(expr)`. It seems that Hive has a bug here, and Spark SQL behaves
differently from Hive.
The Hive language manual says [1] [1]:
> count(expr) - Returns the number of rows for which the supplied
expression is non-NULL
but this doesn't conform to the following results (tested under Hive
0.13.1):
```sql
-- The test table `src1(key INT, value STRING)` is the one we used in Spark
SQL `TestHiveContext`.
-- The table consists of 25 rows, among which 10 `key`s are `NULL`.
CREATE TABLE src1(key INT, value STRING);
LOAD DATA LOCAL INPATH 'data/files/kv3.txt' INTO TABLE src1;
SELECT COUNT(key) FROM src1
WHERE key IS NOT NULL; -- => 15, reasonable
SELECT COUNT(NULL) FROM src1; -- => 0, reasonable
SELECT COUNT(1) FROM src1; -- => 25, reasonable, 1 is never `NULL`
SELECT COUNT(key + 1) FROM src1; -- => 15, reasonable since `NULL + 1`
is `NULL`.
SELECT COUNT(key) FROM src1; -- => 25, huh?
CREATE TABLE tmp AS
SELECT CAST(key AS STRING), value
FROM src1;
SELECT COUNT(key) FROM tmp; -- => 15, hm...
```
I'm not sure whether Hive has something equivalent to the
`StructField.nullable` field in Spark SQL, but it seems that it always assumes
`INT` as not nullable even if the underlying data may contain `NULL`. And
`COUNT(expr)` doesn't check the actual data for null when `expr` is a single
column whose data type is not nullable.
On the other hand, Spark SQL looks good. Here is a sample `hive/console`
session:
```scala
scala> sql("SELECT COUNT(key) FROM src1").collect()
...
res2: Array[org.apache.spark.sql.Row] = Array([15]) // <- Reasonable
scala> table("src1").printSchema
root
|-- key: integer (nullable = true)
|-- value: string (nullable = true)
```
Notice that we consider all fields read from Hive Metastore nullable since
data can be randomly dumped in without any validation.
[1]: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]