Thomas Delnoij wrote:
What would your general advice in such a case be: to start with NDFS right away?
Yes, if you are starting out using multiple nodes. If you are using only a single computer, then NDFS is probably not required.
Also, this document http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS by default uses a replication rate of 2 copies. Based on that, if I want to calculate how much space would be needed for fetching and indexing that amount of pages, how would I go about that?
The default replication level is actually 3 now.
Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 namenode, than that would mean that in practice I would have 900/2 = 450 GB storage. So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB storage on my datanodes?
Yes, that's correct. You need approximately replicationLevel * numberPages * 10kB bytes. MapReduce also generates some large temporary files during computations. My advice is to buy lots of big disks.
Doug
