Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/862#issuecomment-44191684
Hi @ueshin, would you please give some pointer about the semantics of `avg`
function you described? I investigated official documentation and code base of
Hive, and also have run several experiments with Hive 0.12.0, but didn't find
any clue that `avg` should count null values.
For a quick proof, here is a sample session I ran under Hive 0.12.0:
```
hive> create table src(key int, value string);
hive> load data local inpath '/tmp/kv3.txt' into table src;
hive> select avg(key) from src;
...
OK
237.06666666666666
hive> select avg(key) from src where key is not null;
...
OK
237.06666666666666
```
The `kv3.txt` is copied from the Hive test data files, with 15 non-null
keys and 10 null keys:
```
hive> select key, value from src;
hive> select * from src;
...
OK
238 val_238
NULL
311 val_311
NULL val_27
NULL val_165
NULL val_409
255 val_255
278 val_278
98 val_98
NULL val_484
NULL val_265
NULL val_193
401 val_401
150 val_150
273 val_273
224
369
66 val_66
128
213 val_213
146 val_146
406 val_406
NULL
NULL
NULL
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---