It sounds like HDFS probably isn't the right application for you. When new nodes add themselves to the cluster, the administrator needs to rebalance the cluster in order for the new nodes to get data. Without rebalancing, new data will be stored on those new nodes, but old data will not be distributed to these new nodes.
In the case when a node leaves the cluster for 10 minutes, the master will start replicating the blocks that were on that node onto other nodes in the cluster. The point is is that -- though HDFS can handle nodes dying and new nodes being added -- it's not designed for this to happen all the time. Similararly, HDFS doesn't have any security. You would have to configure your own firewall to limit access. I imagine doing so would be really annoying when not all machines are behind the same router. So anyway, you may want to consider other file systems (perhaps there is something P2P out there?) for what you're trying to do. Hope this helps. Alex On Wed, May 27, 2009 at 1:11 PM, Lukasz Szybalski <szybal...@gmail.com>wrote: > Hello, > I wanted to setup hdfs to be used as a public like file system where, > aside from few core computer that will be running masters, you would > have x amount of data nodes/computers that would be located through > the internet? > > How do I setup master servers, and then 3-65+ slave servers, where > each server can come or leave at any time they want. > How would I control how slave servers are added? assuming they would > give me their ip, available size, and in return I would need to > provide then with...? > Should the ssh account that is used be created in some special way? No > shell access? or some restrictions? (command?) > Are there any specific differences that should be accounted for in > this "public" version of hadoop cluster? > > > Let me know. > > Thanks, > Lucas >