[jira] [Commented] (HIVE-5996) Query for sum of a long column of a table with only two rows produces wrong result

Xuefu Zhang (JIRA) Thu, 12 Dec 2013 12:31:34 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846705#comment-13846705
 ]


Xuefu Zhang commented on HIVE-5996:
-----------------------------------

[~ehans] Thanks for sharing your thoughts and your inquiry. For your 
information, I'm not trying to make MySQL as the model. My first line of 
consideration is SQL standard. For a functionality if there is no SQL standard, 
Hive doesn't have to invent everything, thus, I do reference MySQL for ideas, 
mostly because MySQL and its technical documentation are readily available. 
However, this doesn't precludeme following other DB's practice. For instance, 
precision/scale determination for arithmetic operations in hive is following 
SQL server's formula. I'm not either anti- or pro- MySQL. Nor am I to SQL 
server, but I strongly believe that following well-established practices 
benefits Hive than doing something in a unique, unfortunate way. An example 
would be int/int in Hive.

However, a  lot of existing functionality in Hive was put into place when Hive 
is positioned as a tool rather than DB, and before all necessary data types 
were introduced. Take int/int again as an example, early developer probably 
didn't even think about SQL-compliance, and even if he/she did, there wasn't 
decimal data type as a consideration. As Hive is shift to a DB on bigdata 
positioning, I believe that we should start thinking in a perspective other 
than performance or backward compatibility. If we restrict ourselves based on  
unconscious decisions made in the past, we may lose a lot of opportunities of 
doing the right things.

As I worked on decimal precision/scale support, I found a lot of problems in 
Hive about data types and their conversions and promotions. In many cases, Hive 
is not consistent itself. Let me ask you a question to see if you know the 
answer: what's the return type of 35 + '3.14', where 35 is from int column and 
'3.14' from a string column? Before I made the changes, you probably would say: 
wait, let me read the code first. And your answer might be different if my 
question were 35/'3.14'. Now, to answer the same questions, I can give right 
way, and I have a theory to tell why. In summary, it's a lot of effort to clean 
up the mess and inconsistency in Hive from the beginning of my work on decimal.

Now if we use either performance or backward compatibility to shut down what we 
have achieved, I don't see how Hive is shifting from a tool to a DB, and how 
Hive can become adopted as enterprise grade product.

Hive is still evolving, and that's why I think we have certain luxury of 
breaking backward compatibility for doing the right thing. As Ashutosh once 
mentioned, we don't want to be backward compatible to a bug. Once Hive is 
stabilized, it becomes much harder to make backward incompatible changes, as 
you know with your experience with SQL server.

I understand your concern about backward compatibility, especially your 
possible frustration over vectorization breaking or redoing. On the other hand, 
I think we are here to help hive become more useful. A blunt rejection without 
much consideration and communication doesn't seem as helpful and constructive 
as it should be.

> Query for sum of a long column of a table with only two rows produces wrong 
> result
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-5996
>                 URL: https://issues.apache.org/jira/browse/HIVE-5996
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5996.patch
>
>
> {code}
> hive> desc test2;
> OK
> l                     bigint                  None                
> hive> select * from test2;                                 
> OK
> 6666666666666666666
> 5555555555555555555
> hive> select sum(l) from test2;
> OK
> -6224521851487329395
> {code}
> It's believed that a wrap-around error occurred. It's surprising that it 
> happens only with two rows. Same query in MySQL returns:
> {code}
> mysql> select sum(l) from test;
> +----------------------+
> | sum(l)               |
> +----------------------+
> | 12222222222222222221 |
> +----------------------+
> 1 row in set (0.00 sec)
> {code}
> Hive should accommodate large number of rows. Overflowing with only two rows 
> is very unusable.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HIVE-5996) Query for sum of a long column of a table with only two rows produces wrong result

Reply via email to