Hi.

 

I'm having some performance issues when working with a huge number (hundreds
of million) files, whereas map/reduce works reasonable well filesystem
performance leaves much to be desires, mainly because files are stored in
one block per file which gives enormous load on namenode. My current
solution is to pack files to 40-50 mb tar.gz archives and download them
locally (I use streaming) before processing, this work reasonably fine, and
I get exactly one block per file.

 

Is there any standard way to do that? That is to tell hadoop to pack small
files together in these "chunks" and process them accordingly?  

 

Dmitry

Reply via email to