Ok, thanks for your advice. Things are getting more clear now. So, say I want to setup a machine as a DataNode that has two or more disks, do I have to configure and setup a DataNode Deamon for every disk? How else could I use all disks if the ndfs.data.dir property only accepts one path (assumed I don't want to rely on MS Windows' dynamic discs or similar OS specific features)?
Rgrds, Thomas Delnoij On 11/22/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Thomas Delnoij wrote: > > What would your general advice in such a case be: to start with NDFS > right > > away? > > Yes, if you are starting out using multiple nodes. If you are using > only a single computer, then NDFS is probably not required. > > > Also, this document > > http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS > > by default uses a replication rate of 2 copies. Based on > > that, if I want to calculate how much space would be needed for fetching > and > > indexing that amount of pages, how would I go about that? > > The default replication level is actually 3 now. > > > Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 > namenode, > > than that would mean that in practice I would have 900/2 = 450 GB > storage. > > > > So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 > GB > > storage on my datanodes? > > Yes, that's correct. You need approximately replicationLevel * > numberPages * 10kB bytes. MapReduce also generates some large temporary > files during computations. My advice is to buy lots of big disks. > > Doug >
