The reason that we need a lot of mappers is because:
1) The input data size is very large. 2) The number of input files is very large. In order to decrease the number of mappers, set mapred.min.split.size to a big number (like 1000000000 (1GB)). The default value is 128MB. If you increase this, the number of mappers will automatically decrease. Thanks, -namit -----Original Message----- From: Edward Capriolo [mailto:[email protected]] Sent: Tuesday, July 07, 2009 8:15 AM To: [email protected] Subject: Re: Combined data more throughput I want to give a very theoretical, non technical hypothesis as to what is happening here. I updated my cluster to use the trunk version. I did confirm that the map side merging is working. I used diff the with mapside merge true and false and saw the conditional task turning on and off. In my case the map-side-merging is "to little too late". When I took Ashish's approach and used a REDUCE script the reduce tasks was not progressing and seemed to time out. Now with the Map Side Merging the Conditional Merge Task is timing out for the same reason. The Mapper or reducer is dealing with the output of 4000 maps and the overhead is timing the process out. I tried tuning "hive.merge.size.per.mapper" to 100,000,000 and 10,000,000 that did not seem to help. I think the map side merging is probably great for keeping the 'small files problem' from happening, but can not 'fix' it once it has happened. Some point in the process gets hit with lots of inputs. I am going to go to the source of the issue and fix the data ingestion process. Right now, I drop a file per server per five minutes into a hive partition. I can use the map phase to merge these files before they go into the warehouse. Also I am thinking to introduce a second partition based on hour. Each partition might not be too big (600MB-1GB?), but the extra partitioning will make it easier to operate on the data.
