Hadoop: 0.20.1+152 (cloudera) Hive: 0.4.2 *Query (simplified for discussion):* FROM (SELECT c1, c2, day FROM tbl1 WHERE day >= 2010-02-01 AND day <= 2010-02-20) base_qry INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-01 INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-02 .... INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-20
*Goal:* Run a hive query on day N (execution day 2010-02-21) over table "tbl1" partitions day N-1 to N-21 and refresh/overwrite "day" partition of "tbl2" for days N-1 to N-21. *Some Facts:* - Each "day" partition of "tbl1" has 24 files. - The last Hadoop job of base_qry is map only job (and we want it to be a map only job for efficiency reasons) *Issue/problem:* The query execution resulted into 480 files (24*20) for EACH "day" partition of "tbl2" with 24 non-empty files and 456 empty files. Now user running query like "SELECT blah FROM tbl2 WHERE day >= 2010-02-01 AND day <= 2010-02-20" suddenly starts with 480*20 = 9600 mappers (expected 480) and it gets worst with bigger date range. *Work-around:* As a quick fix I am running a FOR loop for each day's partition (or I could delete empty files as post-processing). Any other suggestions? *Solutions:* - I believe Hive 0.5 handles small/empty files well and then this might not be a big issue? - If we can fix this behavior, I will be happy to spend some time on this. Any inputs/guidance? Thanks, Abhi
