Hi Dhruba, All, Thanks for the feedback. It would be under 14 million files, I would expect.
The read/write question is more tricky, although it's not a read only archive - functions more of a repository where files are "checked-out" and when they are checked in files will be updated either in part or whole. (updating in part would be more efficient in terms of network IO I assume). Files would likely be accessed 10-200 times a day, with only 1/10 - 1/100th of the total being accessed during a course of a day. So, it sounds like I could change the default block size to 1MB, and write a MapReduce that simply reads/writes the file. I assume one block is replicated across a few machines. Is there a example code, which you are aware of, for using HDFS for this purpose? [1] Or maybe HDFS isn't designed for this task. Best, Jonathan [1] http://wiki.apache.org/lucene-hadoop/LibHDFS points to your code for libhdfs, http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs/hdfs_test.c?view=mark up, so would my simple use case be a C application or was this code simply a single machine test? Also, interested if anyone had comparative thoughts on http://www.danga.com/mogilefs/ ? Dhruba Borthakur wrote: > Hi Jonathan, > > Thanks for asking this question. I think all your four requirements are > satisfied by HDFS. The one issue that I have is HDFS is not designed to > support a large number of small files, rather fewer number of larger files. > For example, the default block size is 64MB (is configurable). > > That said, version 0.13 is tested to store about 14 million files. Version > 0.15 (to be released in October) should support about 4 times that number. > This limit will probably increase every passing week, but you should > consider this limitation for evaluating HDFS. > > May I ask how many files you might have? How does the number of files grow > over time? How frequently are files accessed? Are you going to use HDFS as a > read-only archival system? > > Thanks, > dhruba > > -----Original Message----- > From: Jonathan Hendler [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 20, 2007 8:25 PM > To: [email protected] > Subject: HDFS instead of NFS/NAS/DAS? > > Hi All, > > I am a complete newbie to Hadoop, not having tested or installed yet, > but reading up for about a month now in spare time, and following the > list. I think it's really exciting to provide this kind of > infrastructure as open source! > > I'll provide context for the subject of this email, and although I've > seen a thread or two about storing many small files in Hadoop, I'm not > sure it addresses the following. > > Goal: > > 1. Many small files (from 1MB-2GB) > 2. Automated "fail-safe" redundancy > 3. Automated synchronization of the redundancy > 4. predictable speed as load / server count increases for read/write > of these files (in part or whole) > > The middleware having access to the files could be used, among other > things, to: > > 1. track "where the files are", and their states > 2. sync differences > > My thinking is that by splitting parts of these files, even if small, > across a number of machines, CRUD will be faster than NFS, as well as > "safer". Also, I'm thinking that using HDFS would be cheaper than DAS / > and more feature rich than NAS [1]. Also, it wouldn't matter "where" the > files were in HDFS, which would simplify the complexity of the > middleware. I also read that DHTs generally don't have intelligent load > balancing, making HDFS type schemes more consistent. > > Since Hadoop is primarily designed to move the computation to where the > data is, does it make sense to use HDFS in this way?[2] > > - Jonathan > > [1] - http://en.wikipedia.org/wiki/Network-attached_storage#Drawbacks > [2] - (assuming the memory limit in the master isn't reached because a > large number of files/blocks) > > > >
