This is likely to fail, yes. The reason why is because you'll almost certainly encounter timeouts in the heartbeats between data nodes and the name node and the task trackers and job tracker. Also, Hadoop uses pipe line replication between data nodes (client -> DN1 -> DN2 -> ...) which will also encounter timeouts or very poor performance. On the processing side, Hadoop doesn't understand the difference between data centers, only racks, and is likely to make bad decisions about spreading work around such that a minimal amount of data is passed over public connections. Then there's the security component (i.e. there isn't any, really)...
There are a lot of reasons not to do this right now. On Sat, Apr 17, 2010 at 4:29 AM, <alta...@ceid.upatras.gr> wrote: > Hello, > > I want to investigate the matter of running hadoop MapReduce jobs over the > Internet. I don't mean in private computers, all of them in different > places, rather a collection of datacenters, connected to each other over > the Internet. > > Would that fail? If yes, how and why? What issues would arise? > -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com