In addition to the responses already provided, there is another downside to 
using hadoop with numerous files: it takes much longer to run a hadoop job!  
Starting a hadoop job consists of communicating between the driver (which runs 
on a client machine outside the cluster) and the namenode to locate all of the 
input files.  Each and every individual file is located with a set of RPCs 
between the client and the cluster and this is done in an entirely serial 
fashion.  In experiments we ran (and gave a talk on at the Hadoop Summit in 
2010) we concluded that this overhead dominated our hadoop jobs.  By reducing 
the number of files (by using sequence files) we could greatly decrease the 
overall job time even though that actual MapReduce time was unaffected (by 
simply reducing the overhead of locating all of the files).

Here's a link to the slides from my talk:
http://www.slideshare.net/ydn/8-image-stackinghadoopsummit2010

Cheers!

On May 22, 2012, at 02:39 , Brendan cheng wrote:

> 
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large 
> file, typically gigabyte to terabytes.What is the downsize of storing million 
> of small files like <10MB?  or what setting of HDFS is suitable for storing 
> small files?
> Actually, I plan to find a distribute filed system for storing mult million 
> of files.
> Brendan                                         


________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________

Reply via email to