Blowout in number of output files while using hive extension multiple inserts

Abhijit Pol Mon, 15 Mar 2010 14:22:09 -0700

Hadoop: 0.20.1+152 (cloudera)
Hive: 0.4.2

*Query (simplified for discussion):*
FROM (SELECT c1, c2, day FROM tbl1 WHERE day >= 2010-02-01 AND day <=
2010-02-20) base_qry
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-01
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-02
....
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-20


*Goal:*
Run a hive query on day N (execution day 2010-02-21) over table "tbl1"
partitions day N-1 to N-21 and refresh/overwrite "day" partition of "tbl2"
for days N-1 to N-21.

*Some Facts:*
- Each "day" partition of "tbl1" has 24 files.
- The last Hadoop job of base_qry is map only job (and we want it to be a
map only job for efficiency reasons)

*Issue/problem:*
The query execution resulted into 480 files (24*20) for EACH "day" partition
of "tbl2" with 24 non-empty files and 456 empty files.

Now user running query like "SELECT blah FROM tbl2 WHERE day >= 2010-02-01
AND day <= 2010-02-20" suddenly starts with 480*20 = 9600 mappers (expected
480) and it gets worst with bigger date range.

*Work-around:*
As a quick fix I am running a FOR loop for each day's partition (or I could
delete empty files as post-processing). Any other suggestions?

*Solutions:*
- I believe Hive 0.5 handles small/empty files well and then this might not
be a big issue?
- If we can fix this behavior, I will be happy to spend some time on this.
Any inputs/guidance?


Thanks,
Abhi

Blowout in number of output files while using hive extension multiple inserts

Reply via email to