Hi,

One benefit to the pre-processing step is the random i/o during
pre-processing
is only done on "new" data, i.e. it is incremental. So you only pay
the random i/o cost once when new data is added. This is better than having
to pay
the random i/o cost every time on all the data (old and new) as would be
required if 
a map/reduce job where to run directly on the local file system.

Thanks


mfc wrote:
> 
> Hi,
> 
> I can see a benefit to this approach if it replaces random
> access of a local file system with sequential access to 
> large files in HDFS. We are talking about physical disks and
> seek time is expensive.
> 
> But the random access of the local file system still happens, 
> it just gets moved to the pre-processing step.
> 
> How about walking thru the relative cost of this pre-processing step
> (which still must do random access), and some approaches to how
> this could be done. You mentioned cat | gzip (assuming parallel instances
> of this), is that what you do?
> 
> Thanks
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Map-Reduce-without-HDFS--tf4331338.html#a12345124
Sent from the Hadoop Users mailing list archive at Nabble.com.

Reply via email to