Re: Combined data more throughput

Edward Capriolo Tue, 07 Jul 2009 08:15:06 -0700

I want to give a very theoretical, non technical hypothesis as to what
is happening here. I updated my cluster to use the trunk version. I
did confirm that the map side merging is working. I used diff the with
mapside merge true and false and saw the conditional task turning on
and off.


In my case the map-side-merging is "to little too late". When I took
Ashish's approach and used a REDUCE script the reduce tasks was not
progressing and seemed to time out. Now with the Map Side Merging the
Conditional Merge Task is timing out for the same reason. The Mapper
or reducer is dealing with the output of 4000 maps and the overhead is
timing the process out. I tried tuning "hive.merge.size.per.mapper" to
100,000,000 and 10,000,000 that did not seem to help.

I think the map side merging is probably great for keeping the 'small
files problem' from happening, but can not 'fix' it once it has
happened. Some point in the process gets hit with lots of inputs.

I am going to go to the source of the issue and fix the data ingestion
process. Right now, I drop a file per server per five minutes into a
hive partition. I can use the map phase to merge these files before
they go into the warehouse. Also I am thinking to introduce a second
partition based on hour. Each partition might not be too big
(600MB-1GB?), but the extra partitioning will make it easier to
operate on the data.

Re: Combined data more throughput

Reply via email to