you're probably talking about speculative execution. you can turn it
off for mappers and reducers, respectively, by setting
mapred.map.tasks.speculative.execution and
mapred.reduce.tasks.speculative.execution to false.

another thing to watch out for is that a task can fail half way, and
then hadoop will re-run it from the beginning. so it's still possible
that the same db queries are executed multiple times.



On Thu, Jul 2, 2009 at 2:13 AM, Marcus Herou<[email protected]> wrote:
> Hi.
>
> I've noticed that hadoop spawns parallell copies of the same task on
> different hosts. I've understood that this is due to improve the performance
> of the job by prioritizing fast running tasks. However since we in our jobs
> connect to databases this leads to conflicts when inserting, updating,
> deleting data (duplicated key etc). Yes I know I should consider Hadoop as a
> "Shared Nothing" architecture but I really must connect to databases in the
> jobs. I've created a sharded DB solution which scales as well or I would be
> doomed...
>
> Any hints of how to disable this feature or howto reduce the impact of it ?
>
> Cheers
>
> /Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [email protected]
> http://www.tailsweep.com/
>

Reply via email to