Hi.
I'm having some performance issues when working with a huge number (hundreds of million) files, whereas map/reduce works reasonable well filesystem performance leaves much to be desires, mainly because files are stored in one block per file which gives enormous load on namenode. My current solution is to pack files to 40-50 mb tar.gz archives and download them locally (I use streaming) before processing, this work reasonably fine, and I get exactly one block per file. Is there any standard way to do that? That is to tell hadoop to pack small files together in these "chunks" and process them accordingly? Dmitry
