Hi, I'm assuming that the input to a Hadoop job is a large set of large ASCII files that you run a map/reduce job on.
If I'm starting with a large number of small ASCII files outside of the HDFS, where/when does the conversion to large files take place? You seem to be recommending a pre-step (is that correct?) that first does cat'ing and gzip'ing in order to convert the small files to big files. Once this is done you copy the big files into the HDFS and run a map/reduce job on them. ...but then the map/reduce job in HADOOP breaks the large files back down into small chunks. This is what prompted the question in the first place about running Map/Reduce directly on the small files in the local file system. I'm wondering if doing the conversion to large files and copy into HDFS would introduce a lot of overhead that would not be neccessary if map/reduce could be run directly on the local file system on the small files. I'd be interested in knowing if this is an appropriate use of Hadoop, I've got limited knowledge about Hadoop, and I'm just trying to learn about where/how it can be used. Thanks -- View this message in context: http://www.nabble.com/Using-Map-Reduce-without-HDFS--tf4331338.html#a12340126 Sent from the Hadoop Users mailing list archive at Nabble.com.
