you're probably talking about speculative execution. you can turn it off for mappers and reducers, respectively, by setting mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution to false.
another thing to watch out for is that a task can fail half way, and then hadoop will re-run it from the beginning. so it's still possible that the same db queries are executed multiple times. On Thu, Jul 2, 2009 at 2:13 AM, Marcus Herou<[email protected]> wrote: > Hi. > > I've noticed that hadoop spawns parallell copies of the same task on > different hosts. I've understood that this is due to improve the performance > of the job by prioritizing fast running tasks. However since we in our jobs > connect to databases this leads to conflicts when inserting, updating, > deleting data (duplicated key etc). Yes I know I should consider Hadoop as a > "Shared Nothing" architecture but I really must connect to databases in the > jobs. I've created a sharded DB solution which scales as well or I would be > doomed... > > Any hints of how to disable this feature or howto reduce the impact of it ? > > Cheers > > /Marcus > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [email protected] > http://www.tailsweep.com/ >
