Hi,
Whenever you try to access DB in parallel you need to design it for that. 
This means, for instance, that you ensure that each of the parallel tasks 
inserts to a distinct "partition" or table in the database to avoid the 
conflicts and failures. Hadoop does the same in its reduce tasks - each 
reduce gets ALL the records that are needed to do its function. So there 
is a hash mapping keys to these tasks. Same principle you need to follow 
when linking DB partitions to your hadoop tasks.

In general, think how to not use DB inserts from Hadoop. Rather, create 
your results in files.
At the end of the process, you can - if this is what you HAVE to do - 
massively load all the records into the database using efficient loading 
utilities.

If you need the DB to communicate among your tasks, meaning that you need 
the inserts to be readily available for other threads to select, than it 
is obviously the wrong media for such sharing and you need to look at 
other solutions to share consistent data among hadoop tasks. For instance, 
zookeeper, etc.

Regards,
- Uri



From:
Marcus Herou <[email protected]>
To:
common-user <[email protected]>
Date:
02/07/2009 12:13 PM
Subject:
Parallell maps



Hi.

I've noticed that hadoop spawns parallell copies of the same task on
different hosts. I've understood that this is due to improve the 
performance
of the job by prioritizing fast running tasks. However since we in our 
jobs
connect to databases this leads to conflicts when inserting, updating,
deleting data (duplicated key etc). Yes I know I should consider Hadoop as 
a
"Shared Nothing" architecture but I really must connect to databases in 
the
jobs. I've created a sharded DB solution which scales as well or I would 
be
doomed...

Any hints of how to disable this feature or howto reduce the impact of it 
?

Cheers

/Marcus

-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/


Reply via email to