Hi, One benefit to the pre-processing step is the random i/o during pre-processing is only done on "new" data, i.e. it is incremental. So you only pay the random i/o cost once when new data is added. This is better than having to pay the random i/o cost every time on all the data (old and new) as would be required if a map/reduce job where to run directly on the local file system.
Thanks mfc wrote: > > Hi, > > I can see a benefit to this approach if it replaces random > access of a local file system with sequential access to > large files in HDFS. We are talking about physical disks and > seek time is expensive. > > But the random access of the local file system still happens, > it just gets moved to the pre-processing step. > > How about walking thru the relative cost of this pre-processing step > (which still must do random access), and some approaches to how > this could be done. You mentioned cat | gzip (assuming parallel instances > of this), is that what you do? > > Thanks > > -- View this message in context: http://www.nabble.com/Using-Map-Reduce-without-HDFS--tf4331338.html#a12345124 Sent from the Hadoop Users mailing list archive at Nabble.com.
