-----Original Message----- From: Keith Wiley [mailto:kwi...@keithwiley.com] Sent: Tuesday, May 22, 2012 9:57 PM To: hdfs-user@hadoop.apache.org Subject: Re: Storing millions of small files
In addition to the responses already provided, there is another downside to using hadoop with numerous files: it takes much longer to run a hadoop job! Starting a hadoop job consists of communicating between the driver (which runs on a client machine outside the cluster) and the namenode to locate all of the input files. Each and every individual file is located with a set of RPCs between the client and the cluster and this is done in an entirely serial fashion. In experiments we ran (and gave a talk on at the Hadoop Summit in 2010) we concluded that this overhead dominated our hadoop jobs. By reducing the number of files (by using sequence files) we could greatly decrease the overall job time even though that actual MapReduce time was unaffected (by simply reducing the overhead of locating all of the files). Here's a link to the slides from my talk: http://www.slideshare.net/ydn/8-image-stackinghadoopsummit2010 Cheers! On May 22, 2012, at 02:39 , Brendan cheng wrote: > > Hi, > I read HDFS architecture doc and it said HDFS is tuned for at storing large > file, typically gigabyte to terabytes.What is the downsize of storing million > of small files like <10MB? or what setting of HDFS is suitable for storing > small files? > Actually, I plan to find a distribute filed system for storing mult million > of files. > Brendan ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together this implies: He scratched the itch from the scratch that itched but would never itch the scratch from the itch that scratched." -- Keith Wiley ________________________________________________________________________________