I want to give a very theoretical, non technical hypothesis as to what is happening here. I updated my cluster to use the trunk version. I did confirm that the map side merging is working. I used diff the with mapside merge true and false and saw the conditional task turning on and off.
In my case the map-side-merging is "to little too late". When I took Ashish's approach and used a REDUCE script the reduce tasks was not progressing and seemed to time out. Now with the Map Side Merging the Conditional Merge Task is timing out for the same reason. The Mapper or reducer is dealing with the output of 4000 maps and the overhead is timing the process out. I tried tuning "hive.merge.size.per.mapper" to 100,000,000 and 10,000,000 that did not seem to help. I think the map side merging is probably great for keeping the 'small files problem' from happening, but can not 'fix' it once it has happened. Some point in the process gets hit with lots of inputs. I am going to go to the source of the issue and fix the data ingestion process. Right now, I drop a file per server per five minutes into a hive partition. I can use the map phase to merge these files before they go into the warehouse. Also I am thinking to introduce a second partition based on hour. Each partition might not be too big (600MB-1GB?), but the extra partitioning will make it easier to operate on the data.
