You don't want to use DFS on top of NFS. If you use DFS, keep its data on the local drives, not in NFS. If you want to use NFS for shared data, then simply don't use DFS: specify "local" as the filesystem and don't start datanodes or a namenode.

I think you'll find DFS will perform better than NFS for crawling, indexing, etc. If you like, at the end, you could copy the final index from DFS onto your NFS server, if that's where you'd prefer to have it.

Does that help?

Doug

Adam Taylor wrote:
Hello, I've started to do some initial test runs with Hadoop 0.4.0, Nutch
0.8 and Nutchwax 0.6+.   My setup includes several rack mount servers that
will be used for distributed indexing and a clustered file server that is
NFS mounted on each server.  I would like for all of the hadoop slaves to
write the index to the file server (instead of to local disk).

I am curious, if the Hadoop master and its slaves will be accessing the same
file server to store the index, will it be possible to run the index in
distributed mode but specify "local" for the file system? I have tried doing it this way and couldn't get it to work. It seems that all documentation for
Hadoop suggests using distributed mode for both the file system and the
indexing. However, if I try with a distributed file system with my setup,
each slave is writing to the same file server so we get a conflict: "Cannot
start multiple Datanode instances sharing the same data directory"

Thanks!
Adam

Reply via email to