On Thu, Apr 16, 2009 at 1:27 AM, tim robertson <[email protected]>wrote:
> > What is not 100% clear to me is when to push to S3: > In the Map I will output the TileId-ZoomLevel-SpeciesId as the key, > along with the count, and in the Reduce I group the counts into larger > tiles, and create the PNG. I could write to Sequencefile here... but > I suspect I could just push to the s3 bucket here also - as long as > the task tracker does not send the same Keys to multiple reduce tasks > - my Hadoop naivity showing here (I wrote an in memory threaded > MapReduceLite which does not compete reducers, but not got into the > Hadoop code quite so much yet). > > Hi Tim, If I understand what you mean by "compete reducers", then you're referring to the feature called "speculative execution", in which Hadoop schedules multiple TaskTrackers to perform the same task. When one of the multiply-scheduled tasks finishes, the other one is killed. As you seem to already understand, this might cause issues if your tasks have non-idempotent side effects on the outside world. The configuration variable you need to look at is mapred.reduce.tasks.speculative.execution. If this is set to false, only one reduce task will be run on each key. If it is true, it's possible that some reduce tasks will be scheduled twice to try to reduce variance in job completion times due to slow machines. There's an equivalent configuration variable mapred.map.tasks.speculative.execution that controls this behavior for your map tasks. Hope that helps, -Todd
