Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/862#issuecomment-44191684
  
    Hi @ueshin, would you please give some pointer about the semantics of `avg` 
function you described? I investigated official documentation and code base of 
Hive, and also have run several experiments with Hive 0.12.0, but didn't find 
any clue that `avg` should count null values.
    
    For a quick proof, here is a sample session I ran under Hive 0.12.0:
    
    ```
    hive> create table src(key int, value string);             
    hive> load data local inpath '/tmp/kv3.txt' into table src;
    hive> select avg(key) from src;                  
    ...
    OK
    237.06666666666666
    hive> select avg(key) from src where key is not null;
    ...
    OK
    237.06666666666666
    ```
    
    The `kv3.txt` is copied from the Hive test data files, with 15 non-null 
keys and 10 null keys:
    
    ```
    hive> select key, value from src;
    hive> select * from src;                                   
    ...
    OK
    238     val_238
    NULL
    311     val_311
    NULL    val_27
    NULL    val_165
    NULL    val_409
    255     val_255
    278     val_278
    98      val_98
    NULL    val_484
    NULL    val_265
    NULL    val_193
    401     val_401
    150     val_150
    273     val_273
    224
    369
    66      val_66
    128
    213     val_213
    146     val_146
    406     val_406
    NULL
    NULL
    NULL
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to