[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

Jason Dere (JIRA) Mon, 09 Dec 2013 11:28:39 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843441#comment-13843441
 ]


Jason Dere commented on HIVE-5878:
----------------------------------

Just adding a couple of bits of info to this discussion.  There are some 
significant effects for users if we decide to go with the change to returning 
decimal type (both for avg and for int / int).

1) As I mentioned in HIVE-5356, there is a significant performance hit to 
switching to decimal arithmetic.  I suppose for avg() there could be some ways 
to improve this for int type, such as keeping the sum/count portions as long, 
and only doing decimal math when computing sum / count.

2) Decimal arithmetic can return NULL for some non-null inputs, if Hive is not 
able to represent the entire result exactly.  While most common cases (such as 
the example in HIVE-5022) should not longer occur, in the general case it is 
still possible to get a NULL result if the expression contains enough 
operations.  I know, there has been some discussion on this already, and some 
of the reasoning behind why those semantics were ok for decimal arithmetic was 
that if the user had made the choice to use decimal columns/types, then they 
knew what they were getting into with the decimal arithmetic.  But in this case 
users are going to start seeing the effects of these decimal semantics all over 
the place because they will start getting applied on expressions involving 
integer columns.  I don't think that users will find this NULL behavior 
desirable with their integer expressions, so I think something will need to be 
done one way or another here - either don't return decimal type in these 
situations, or possibly address this null behavior in some way.

> Hive standard avg UDAF returns double as the return type for some exact input 
> types
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-5878
>                 URL: https://issues.apache.org/jira/browse/HIVE-5878
>             Project: Hive
>          Issue Type: Bug
>          Components: Types, UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the 
> result type.
> {code}
> hive> desc test;
> OK
> d                     int                     None                
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: avg(VALUE._col0)
>           bucketGroup: false
>           mode: mergepartial
>           outputColumnNames: _col0
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. 
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +-------+--------------+------+-----+---------+-------+
> | Field | Type         | Null | Key | Default | Extra |
> +-------+--------------+------+-----+---------+-------+
> | i     | int(11)      | YES  |     | NULL    |       |
> | b     | tinyint(1)   | YES  |     | NULL    |       |
> | d     | double       | YES  |     | NULL    |       |
> | s     | varchar(5)   | YES  |     | NULL    |       |
> | dd    | decimal(5,2) | YES  |     | NULL    |       |
> +-------+--------------+------+-----+---------+-------+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +-------+---------------+------+-----+---------+-------+
> | Field | Type          | Null | Key | Default | Extra |
> +-------+---------------+------+-----+---------+-------+
> | avg(i) | decimal(14,4) | YES  |     | NULL    |       |
> +-------+---------------+------+-----+---------+-------+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types

Reply via email to