I understand that although I can write an InputFormat / splitter which
starts with a file in NFS not HDFS.
Also when I count import to hdfs as a part of processing haven't I gone to
a single machine to read all data?


On Mon, May 19, 2014 at 2:30 PM, David Rosenstrauch <dar...@darose.net>wrote:

> The reason why you want to copy to hdfs first is that hdfs splits the data
> and distributes it across the nodes in the cluster.  So if your input data
> is large, you'll get much better efficiency/speed in processing it if
> you're processing it in a distributed manner.  (I.e., multiple machines
> each processing a piece of it - multiple mappers.) I'd think that keeping
> the in NFS would be quite slow.
>
> HTH,
>
> DR
>
>
> On 05/15/2014 04:45 PM, Steve Lewis wrote:
>
>> I have a medium size data set in the terrabytes range that currently lives
>> in the nfs file server of a medium institution. Every few months we want
>> to
>> run a chain of five Hadoop jobs on this data.
>>     The cluster is medium sized - 40 nodes about 200 simultaneous jobs.
>> The
>> book says copy the data to HDFS and run the job. If I consider copy to
>> hdfs
>> and the first mapper as a single task I wonder if it is not as easy to
>> have
>> a custom reader reading from the NFS file system as a local file and skip
>> the step of copying to hadoop.
>>     While the read to the mapper may be slower, dropping the copy to hdfs
>> could well make up the difference. Assume that after the job runs the data
>> will be deleted from hdfs - the nfs system is the primary source and that
>> cannot change. Also the job is not I/O limited - there is significant
>> computation at each step
>>
>>      My questions are
>>    1) are my assumptions correct and not copying the data may save time?
>>    2) would 200 Hadoop jobs overwhelm a medium sized nfs system?
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to