Thomas Delnoij wrote:
What would your general advice in such a case be: to start with NDFS right
away?

Yes, if you are starting out using multiple nodes. If you are using only a single computer, then NDFS is probably not required.

Also, this document
http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
by default uses a replication rate of 2 copies. Based on
that, if I want to calculate how much space would be needed for fetching and
indexing that amount of pages, how would I go about that?

The default replication level is actually 3 now.

Suppose I setup 3 datanodes, each with 300 GB storage space, and 1 namenode,
than that would mean that in practice I would have 900/2 = 450 GB storage.

So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000 GB
storage on my datanodes?

Yes, that's correct. You need approximately replicationLevel * numberPages * 10kB bytes. MapReduce also generates some large temporary files during computations. My advice is to buy lots of big disks.

Doug

Reply via email to