cocopc opened a new issue #1736: URL: https://github.com/apache/hudi/issues/1736
Env: Hive 2.1.1 Hudi: 0.5.2 Spark: 2.4.5 MOR table and upsert operation , when query with spark-sql get the right result, but query with hive-on-mr get the wrong result. My Table Info: Table Name: user Recored Key: distinct_id SQL : select distinct_id ,count(1) from user group by distinct_id order by distinct_id desc limit 10 Query with Spark ,result is right. +-----------+--------+ |distinct_id|count(1)| +-----------+--------+ | 51819928| 1| | 51819908| 1| | 51819791| 1| | 51819580| 1| | 51819136| 1| | 51819001| 1| | 51818734| 1| | 51818645| 1| | 51818417| 1| | 51818329| 1| +-----------+--------+ Query with hive: result is wrong, the count value should be 1 for each distinct_id ,because the distinct_id is record key , upsert shoud be merge. +--------------+-----+--+ | distinct_id | c1 | +--------------+-----+--+ | 51819928 | 8 | | 51819908 | 22 | | 51819791 | 7 | | 51819580 | 11 | | 51819136 | 9 | | 51819001 | 24 | | 51818734 | 9 | | 51818645 | 23 | | 51818417 | 22 | | 51818329 | 26 | Query with hive: select * from user where distinct_id='51819928' ; the query result only one row, it is right. so strange! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
