I need advice on whether my starting data needs to be in HDFS

Steve Lewis Fri, 16 May 2014 07:20:07 -0700

I have a medium size data set in the terrabytes range that currently lives
in the nfs file server of a medium institution. Every few months we want to
run a chain of five Hadoop jobs on this data.
   The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The
book says copy the data to HDFS and run the job. If I consider copy to hdfs
and the first mapper as a single task I wonder if it is not as easy to have
a custom reader reading from the NFS file system as a local file and skip
the step of copying to hadoop.
   While the read to the mapper may be slower, dropping the copy to hdfs
could well make up the difference. Assume that after the job runs the data
will be deleted from hdfs - the nfs system is the primary source and that
cannot change. Also the job is not I/O limited - there is significant
computation at each step


    My questions are
  1) are my assumptions correct and not copying the data may save time?
  2) would 200 Hadoop jobs overwhelm a medium sized nfs system?

I need advice on whether my starting data needs to be in HDFS

Reply via email to