I understand that although I can write an InputFormat / splitter which starts with a file in NFS not HDFS. Also when I count import to hdfs as a part of processing haven't I gone to a single machine to read all data?
On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch <dar...@darose.net>wrote: > The reason why you want to copy to hdfs first is that hdfs splits the data > and distributes it across the nodes in the cluster. So if your input data > is large, you'll get much better efficiency/speed in processing it if > you're processing it in a distributed manner. (I.e., multiple machines > each processing a piece of it - multiple mappers.) I'd think that keeping > the in NFS would be quite slow. > > HTH, > > DR > > > On 05/15/2014 04:45 PM, Steve Lewis wrote: > >> I have a medium size data set in the terrabytes range that currently lives >> in the nfs file server of a medium institution. Every few months we want >> to >> run a chain of five Hadoop jobs on this data. >> The cluster is medium sized - 40 nodes about 200 simultaneous jobs. >> The >> book says copy the data to HDFS and run the job. If I consider copy to >> hdfs >> and the first mapper as a single task I wonder if it is not as easy to >> have >> a custom reader reading from the NFS file system as a local file and skip >> the step of copying to hadoop. >> While the read to the mapper may be slower, dropping the copy to hdfs >> could well make up the difference. Assume that after the job runs the data >> will be deleted from hdfs - the nfs system is the primary source and that >> cannot change. Also the job is not I/O limited - there is significant >> computation at each step >> >> My questions are >> 1) are my assumptions correct and not copying the data may save time? >> 2) would 200 Hadoop jobs overwhelm a medium sized nfs system? >> >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com