RE: Combined data more throughput

Namit Jain Wed, 08 Jul 2009 10:40:58 -0700

 

The reason that we need a lot of mappers is because:


1) The input data size is very large.
2) The number of input files is very large.

 

In order to decrease the number of mappers, set mapred.min.split.size to a big 
number (like 1000000000 (1GB)).
The default value is 128MB. If you increase this, the number of mappers will 
automatically decrease.


Thanks,
-namit





-----Original Message-----
From: Edward Capriolo [mailto:[email protected]] 
Sent: Tuesday, July 07, 2009 8:15 AM
To: [email protected]
Subject: Re: Combined data more throughput

I want to give a very theoretical, non technical hypothesis as to what
is happening here. I updated my cluster to use the trunk version. I
did confirm that the map side merging is working. I used diff the with
mapside merge true and false and saw the conditional task turning on
and off.

In my case the map-side-merging is "to little too late". When I took
Ashish's approach and used a REDUCE script the reduce tasks was not
progressing and seemed to time out. Now with the Map Side Merging the
Conditional Merge Task is timing out for the same reason. The Mapper
or reducer is dealing with the output of 4000 maps and the overhead is
timing the process out. I tried tuning "hive.merge.size.per.mapper" to
100,000,000 and 10,000,000 that did not seem to help.

I think the map side merging is probably great for keeping the 'small
files problem' from happening, but can not 'fix' it once it has
happened. Some point in the process gets hit with lots of inputs.

I am going to go to the source of the issue and fix the data ingestion
process. Right now, I drop a file per server per five minutes into a
hive partition. I can use the map phase to merge these files before
they go into the warehouse. Also I am thinking to introduce a second
partition based on hour. Each partition might not be too big
(600MB-1GB?), but the extra partitioning will make it easier to
operate on the data.

RE: Combined data more throughput

Reply via email to