This is likely to fail, yes. The reason why is because you'll almost
certainly encounter timeouts in the heartbeats between data nodes and
the name node and the task trackers and job tracker. Also, Hadoop uses
pipe line replication between data nodes (client -> DN1 -> DN2 -> ...)
which will also encounter timeouts or very poor performance. On the
processing side, Hadoop doesn't understand the difference between data
centers, only racks, and is likely to make bad decisions about
spreading work around such that a minimal amount of data is passed
over public connections. Then there's the security component (i.e.
there isn't any, really)...

There are a lot of reasons not to do this right now.

On Sat, Apr 17, 2010 at 4:29 AM,  <alta...@ceid.upatras.gr> wrote:
> Hello,
>
> I want to investigate the matter of running hadoop MapReduce jobs over the
> Internet. I don't mean in private computers, all of them in different
> places, rather a collection of datacenters, connected to each other over
> the Internet.
>
> Would that fail? If yes, how and why? What issues would arise?
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Reply via email to