The short version is that the in-memory structures used by the NameNode are "heavy" on a per-file basis, and light on a per-block basis. So petabytes of files that are only a few hundred KB will require the NameNode to have a huge amount of memory to hold the filesystem data structures. More than you want to pay to fit in a single server.
There's more details about this issue in an older post on our blog: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ - Aaron On Mon, Mar 30, 2009 at 12:46 AM, deepya <[email protected]> wrote: > > Hi, > Thanks. > > Can you please specify in detail what kind of problems I will face if I > use Hadoop for this project. > > SreeDeepya > > > TimRobertson100 wrote: > > > > I believe Hadoop is not best suited to many small files like yours but > > is really geared to handling very large files that get split into many > > smaller files (like 128M chunks) and HDFS is designed with this in > > mind. Therefore I could *imagine* that there are other distributed > > file systems that would far outperform HDFS if they were designed to > > replicate and track small files without any *split* and *merging* > > which Hadoop provides. > > > > Having not used MogileFS I cant really advise well but a quick read > > through does look like it might be a candidate for you to consider - > > it looks like it distributes across machines and tracks replicas like > > HDFS without the splitting, and offers access through http to the > > individual files which I could imagine would be ideal for pulling back > > small images. > > > > Please don't just follow my advise though - I am still a relative > > newbie to DFS's in general. > > > > Cheers > > > > Tim > > > > > > > > On Sun, Mar 29, 2009 at 12:51 PM, deepya <[email protected]> wrote: > >> > >> Hi, > >> I am doing a project scalable storage server to store images.Can Hadoop > >> efficiently support this purpose???Our image size will be around 250 to > >> 300 > >> KB each.But we have many such images.Like the total storage may run upto > >> petabytes( in future) .At present it is in gigabytes. > >> We want to access these images via apache server.I mean,is there any > >> mechanism that we can directly talk to hdfs via apache server??? > >> > >> I went through one of the posts here and got to know that rather than > >> using > >> FUSE it is better to use HDFS API.That is fine.But they also mentioned > >> that > >> mozilefs will be more appropriate. > >> > >> Can some one please clarify why mozilefs is more appropriate.Cant hadoop > >> be > >> used???How is mozile more advantageous.Can you suggest which filesystem > >> would be more appropriate for the project I am doing at present. > >> > >> Thanks in advance > >> > >> > >> SreeDeepya > >> -- > >> View this message in context: > >> > http://www.nabble.com/a-doubt-regarding-an-appropriate-file-system-tp22766331p22766331.html > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/a-doubt-regarding-an-appropriate-file-system-tp22766331p22777879.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >
