Ok, thanks for your advice. Things are getting more clear now.

So, say I want to setup a machine as a DataNode that has two or more disks,
do I have to configure and setup a DataNode Deamon for every disk? How else
could I use all disks if the ndfs.data.dir property only accepts one path
(assumed I don't want to rely on MS Windows' dynamic discs or similar OS
specific features)?

Rgrds, Thomas Delnoij






On 11/22/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Thomas Delnoij wrote:
> > What would your general advice in such a case be: to start with NDFS
> right
> > away?
>
> Yes, if you are starting out using multiple nodes.  If you are using
> only a single computer, then NDFS is probably not required.
>
> > Also, this document
> > http://wiki.apache.org/nutch/NutchDistributedFileSystemsays that NDFS
> > by default uses a replication rate of 2 copies. Based on
> > that, if I want to calculate how much space would be needed for fetching
> and
> > indexing that amount of pages, how would I go about that?
>
> The default replication level is actually 3 now.
>
> > Suppose I setup 3 datanodes, each with 300 GB storage space, and 1
> namenode,
> > than that would mean that in practice I would have 900/2 = 450 GB
> storage.
> >
> > So for 100.000.000 pages, averaging 10 Kb each, I would need up to 2000
> GB
> > storage on my datanodes?
>
> Yes, that's correct.  You need approximately replicationLevel *
> numberPages * 10kB bytes.  MapReduce also generates some large temporary
> files during computations.  My advice is to buy lots of big disks.
>
> Doug
>

Reply via email to