[
https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843441#comment-13843441
]
Jason Dere commented on HIVE-5878:
----------------------------------
Just adding a couple of bits of info to this discussion. There are some
significant effects for users if we decide to go with the change to returning
decimal type (both for avg and for int / int).
1) As I mentioned in HIVE-5356, there is a significant performance hit to
switching to decimal arithmetic. I suppose for avg() there could be some ways
to improve this for int type, such as keeping the sum/count portions as long,
and only doing decimal math when computing sum / count.
2) Decimal arithmetic can return NULL for some non-null inputs, if Hive is not
able to represent the entire result exactly. While most common cases (such as
the example in HIVE-5022) should not longer occur, in the general case it is
still possible to get a NULL result if the expression contains enough
operations. I know, there has been some discussion on this already, and some
of the reasoning behind why those semantics were ok for decimal arithmetic was
that if the user had made the choice to use decimal columns/types, then they
knew what they were getting into with the decimal arithmetic. But in this case
users are going to start seeing the effects of these decimal semantics all over
the place because they will start getting applied on expressions involving
integer columns. I don't think that users will find this NULL behavior
desirable with their integer expressions, so I think something will need to be
done one way or another here - either don't return decimal type in these
situations, or possibly address this null behavior in some way.
> Hive standard avg UDAF returns double as the return type for some exact input
> types
> -----------------------------------------------------------------------------------
>
> Key: HIVE-5878
> URL: https://issues.apache.org/jira/browse/HIVE-5878
> Project: Hive
> Issue Type: Bug
> Components: Types, UDF
> Affects Versions: 0.12.0
> Reporter: Xuefu Zhang
> Assignee: Xuefu Zhang
> Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the
> result type.
> {code}
> hive> desc test;
> OK
> d int None
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;
> ...
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: avg(VALUE._col0)
> bucketGroup: false
> mode: mergepartial
> outputColumnNames: _col0
> Select Operator
> expressions:
> expr: _col0
> type: double
> {code}
> However, exact types including integers and decimal should yield exact type.
> Here is what MySQL does:
> {code}
> mysql> desc test;
> +-------+--------------+------+-----+---------+-------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+--------------+------+-----+---------+-------+
> | i | int(11) | YES | | NULL | |
> | b | tinyint(1) | YES | | NULL | |
> | d | double | YES | | NULL | |
> | s | varchar(5) | YES | | NULL | |
> | dd | decimal(5,2) | YES | | NULL | |
> +-------+--------------+------+-----+---------+-------+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +-------+---------------+------+-----+---------+-------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+---------------+------+-----+---------+-------+
> | avg(i) | decimal(14,4) | YES | | NULL | |
> +-------+---------------+------+-----+---------+-------+
> 1 row in set (0.00 sec)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)