Pavan Gadam Manohar created HIVE-6099: -----------------------------------------
Summary: Multi insert does not work properly with distinct count Key: HIVE-6099 URL: https://issues.apache.org/jira/browse/HIVE-6099 Project: Hive Issue Type: Bug Affects Versions: 0.10.0, 0.9.0 Reporter: Pavan Gadam Manohar Need 2 rows to reproduce this Bug. Here are the steps. Step 1) Create a table Table_A CREATE EXTERNAL TABLE Table_A ( user string , type int ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS RCFILE LOCATION '/hive/<path>/Table_A'; Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123. Insert 2 records into the table created above. select * from Table_A; hive> select * from table_a; OK tommy 123 2013-12-02 tommy 111 2013-12-02 Step 3) Create 2 destination tables to simulate multi-insert. CREATE EXTERNAL TABLE dest_Table_A ( p_date string , Distinct_Users int , Type111Users int , Type123Users int ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS RCFILE LOCATION '/hive/<path>/dest_Table_A'; CREATE EXTERNAL TABLE dest_Table_B ( p_date string , Distinct_Users int , Type111Users int , Type123Users int ) PARTITIONED BY (dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS RCFILE LOCATION '/hive/<path>/dest_Table_B'; Step 4) Multi insert statement from Table_A a INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02') select a.dt ,count(distinct a.user) as AllDist ,count(distinct case when a.type = 111 then a.user else null end) as Type111User ,count(distinct case when a.type != 111 then a.user else null end) as Type123User group by a.dt INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02') select a.dt ,count(distinct a.user) as AllDist ,count(distinct case when a.type = 111 then a.user else null end) as Type111User ,count(distinct case when a.type != 111 then a.user else null end) as Type123User group by a.dt ; Step 5) Verify results. hive> select * from dest_table_a; OK 2013-12-02 2 1 1 2013-12-02 Time taken: 0.116 seconds hive> select * from dest_table_b; OK 2013-12-02 2 1 1 2013-12-02 Time taken: 0.13 seconds Conclusion: Hive gives a count of 2 for distinct users although there is only one distinct user. After trying many datasets observed that Hive is doing Type111Users + Typoe123Users = DistinctUsers which is wrong. hive> select count(distinct a.user) from table_a a; Gives: Total MapReduce CPU Time Spent: 4 seconds 350 msec OK 1 -- This message was sent by Atlassian JIRA (v6.1.5#6160)