I have a medium size data set in the terrabytes range that currently lives in the nfs file server of a medium institution. Every few months we want to run a chain of five Hadoop jobs on this data. The cluster is medium sized - 40 nodes about 200 simultaneous jobs. The book says copy the data to HDFS and run the job. If I consider copy to hdfs and the first mapper as a single task I wonder if it is not as easy to have a custom reader reading from the NFS file system as a local file and skip the step of copying to hadoop. While the read to the mapper may be slower, dropping the copy to hdfs could well make up the difference. Assume that after the job runs the data will be deleted from hdfs - the nfs system is the primary source and that cannot change. Also the job is not I/O limited - there is significant computation at each step
My questions are 1) are my assumptions correct and not copying the data may save time? 2) would 200 Hadoop jobs overwhelm a medium sized nfs system?