Hi.

Yep I recon this. As I said, I have created a sharded DB solution where all
data is currently spread across 50 shards. I totally agree that one should
try to emit data to the outputCollector but when one have the data in memory
I do not want to throw it away and re-read it into memory again later on
after the job is done by chained Hadoop jobs... In our stats system we do
exactly this, almost no db all over the complex chain, just reducing and
counting like crazy :)

There is another pro following your example and that is that you risk less
during the actual job since you do not have to create a db-pool of thousands
of connections. And the jobs do not affect the live production DB = good.

Anyway what do you mean that if you HAVE to load it into a db ? How the heck
are my users gonna access our search engine without accessing a populated
index or DB ?
I've been thinking about this in many use-cases and perhaps I am thinking
totally wrong but I tend to not be able to implement a shared-nothing arch
all over the chain it is simply not possible. Sometimes you need a DB,
sometimes you need an index, sometimes a distributed cache, sometimes a file
on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS due
to the non wrapper characteristict of HDFS and it's speed, random access IO
that is).

I will put a mental note to adapt and be more hadoopish :)

Cheers

//Marcus

On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <[email protected]> wrote:

> Hi,
> Whenever you try to access DB in parallel you need to design it for that.
> This means, for instance, that you ensure that each of the parallel tasks
> inserts to a distinct "partition" or table in the database to avoid the
> conflicts and failures. Hadoop does the same in its reduce tasks - each
> reduce gets ALL the records that are needed to do its function. So there
> is a hash mapping keys to these tasks. Same principle you need to follow
> when linking DB partitions to your hadoop tasks.
>
> In general, think how to not use DB inserts from Hadoop. Rather, create
> your results in files.
> At the end of the process, you can - if this is what you HAVE to do -
> massively load all the records into the database using efficient loading
> utilities.
>
> If you need the DB to communicate among your tasks, meaning that you need
> the inserts to be readily available for other threads to select, than it
> is obviously the wrong media for such sharing and you need to look at
> other solutions to share consistent data among hadoop tasks. For instance,
> zookeeper, etc.
>
> Regards,
> - Uri
>
>
>
> From:
> Marcus Herou <[email protected]>
> To:
> common-user <[email protected]>
> Date:
> 02/07/2009 12:13 PM
> Subject:
> Parallell maps
>
>
>
> Hi.
>
> I've noticed that hadoop spawns parallell copies of the same task on
> different hosts. I've understood that this is due to improve the
> performance
> of the job by prioritizing fast running tasks. However since we in our
> jobs
> connect to databases this leads to conflicts when inserting, updating,
> deleting data (duplicated key etc). Yes I know I should consider Hadoop as
> a
> "Shared Nothing" architecture but I really must connect to databases in
> the
> jobs. I've created a sharded DB solution which scales as well or I would
> be
> doomed...
>
> Any hints of how to disable this feature or howto reduce the impact of it
> ?
>
> Cheers
>
> /Marcus
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [email protected]
> http://www.tailsweep.com/
>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/

Reply via email to