I think the biggest issue would be upstream bandwidth and latency. If the thought was to use a Seti type approach, most users wouldn't have the necessary upstream bandwidth to support the DFS. It would be likely that a few local desktop machines would significantly out pace a much larger DSL/cable/etc. based "cluster."
Nick On Sat, Apr 17, 2010 at 12:43 PM, Eric Sammer <esam...@cloudera.com> wrote: > This is likely to fail, yes. The reason why is because you'll almost > certainly encounter timeouts in the heartbeats between data nodes and > the name node and the task trackers and job tracker. Also, Hadoop uses > pipe line replication between data nodes (client -> DN1 -> DN2 -> ...) > which will also encounter timeouts or very poor performance. On the > processing side, Hadoop doesn't understand the difference between data > centers, only racks, and is likely to make bad decisions about > spreading work around such that a minimal amount of data is passed > over public connections. Then there's the security component (i.e. > there isn't any, really)... > > There are a lot of reasons not to do this right now. > > On Sat, Apr 17, 2010 at 4:29 AM, <alta...@ceid.upatras.gr> wrote: > > Hello, > > > > I want to investigate the matter of running hadoop MapReduce jobs over > the > > Internet. I don't mean in private computers, all of them in different > > places, rather a collection of datacenters, connected to each other over > > the Internet. > > > > Would that fail? If yes, how and why? What issues would arise? > > > > > > -- > Eric Sammer > phone: +1-917-287-2675 > twitter: esammer > data: www.cloudera.com >