[ https://issues.apache.org/jira/browse/HIVE-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846705#comment-13846705 ]
Xuefu Zhang commented on HIVE-5996: ----------------------------------- [~ehans] Thanks for sharing your thoughts and your inquiry. For your information, I'm not trying to make MySQL as the model. My first line of consideration is SQL standard. For a functionality if there is no SQL standard, Hive doesn't have to invent everything, thus, I do reference MySQL for ideas, mostly because MySQL and its technical documentation are readily available. However, this doesn't precludeme following other DB's practice. For instance, precision/scale determination for arithmetic operations in hive is following SQL server's formula. I'm not either anti- or pro- MySQL. Nor am I to SQL server, but I strongly believe that following well-established practices benefits Hive than doing something in a unique, unfortunate way. An example would be int/int in Hive. However, a lot of existing functionality in Hive was put into place when Hive is positioned as a tool rather than DB, and before all necessary data types were introduced. Take int/int again as an example, early developer probably didn't even think about SQL-compliance, and even if he/she did, there wasn't decimal data type as a consideration. As Hive is shift to a DB on bigdata positioning, I believe that we should start thinking in a perspective other than performance or backward compatibility. If we restrict ourselves based on unconscious decisions made in the past, we may lose a lot of opportunities of doing the right things. As I worked on decimal precision/scale support, I found a lot of problems in Hive about data types and their conversions and promotions. In many cases, Hive is not consistent itself. Let me ask you a question to see if you know the answer: what's the return type of 35 + '3.14', where 35 is from int column and '3.14' from a string column? Before I made the changes, you probably would say: wait, let me read the code first. And your answer might be different if my question were 35/'3.14'. Now, to answer the same questions, I can give right way, and I have a theory to tell why. In summary, it's a lot of effort to clean up the mess and inconsistency in Hive from the beginning of my work on decimal. Now if we use either performance or backward compatibility to shut down what we have achieved, I don't see how Hive is shifting from a tool to a DB, and how Hive can become adopted as enterprise grade product. Hive is still evolving, and that's why I think we have certain luxury of breaking backward compatibility for doing the right thing. As Ashutosh once mentioned, we don't want to be backward compatible to a bug. Once Hive is stabilized, it becomes much harder to make backward incompatible changes, as you know with your experience with SQL server. I understand your concern about backward compatibility, especially your possible frustration over vectorization breaking or redoing. On the other hand, I think we are here to help hive become more useful. A blunt rejection without much consideration and communication doesn't seem as helpful and constructive as it should be. > Query for sum of a long column of a table with only two rows produces wrong > result > ---------------------------------------------------------------------------------- > > Key: HIVE-5996 > URL: https://issues.apache.org/jira/browse/HIVE-5996 > Project: Hive > Issue Type: Bug > Components: UDF > Affects Versions: 0.12.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-5996.patch > > > {code} > hive> desc test2; > OK > l bigint None > hive> select * from test2; > OK > 6666666666666666666 > 5555555555555555555 > hive> select sum(l) from test2; > OK > -6224521851487329395 > {code} > It's believed that a wrap-around error occurred. It's surprising that it > happens only with two rows. Same query in MySQL returns: > {code} > mysql> select sum(l) from test; > +----------------------+ > | sum(l) | > +----------------------+ > | 12222222222222222221 | > +----------------------+ > 1 row in set (0.00 sec) > {code} > Hive should accommodate large number of rows. Overflowing with only two rows > is very unusable. -- This message was sent by Atlassian JIRA (v6.1.4#6159)