HDFS should work pretty well for this.  You should test open times with
realistic file sets.  Since HDFS is so seriously optimized for streaming
reads of large files, open time has never been much concern.

Also, I would probably leave the block size alone.  Short blocks aren't
going to be a problem and breaking up small files isn't going to help you
since you are likely to have single readers for individual files.  The time
that it would help is where you have lots of readers who are reading
different byte ranges of the files.  Map-reduce programs are that sort of
thing, checking files out now and again is not such a thing.

MogileFS might be a good alternative as well.  It is much more oriented
toward serving lots of little files but would be a very bad choice for
map-reduce programs.  I have the impression that mogileFS is slightly more
amenable to high availability operation, but HDFS is coming along pretty
quickly even if that isn't an explicit goal.

On 9/26/07 2:28 AM, "Jonathan Hendler" <[EMAIL PROTECTED]> wrote:

> Hi Dhruba, All,
> 
> Thanks for the feedback.
> It would be under 14 million files, I would expect.
> 
> The read/write question is more tricky, although it's not a read only
> archive - functions more of a repository where files are "checked-out"
> and when they are checked in files will be updated either in part or
> whole. (updating in part would be more efficient in terms of network IO
> I assume). Files would likely be accessed 10-200 times a day, with only
> 1/10 - 1/100th of the total being accessed during a course of a day.
> 
> So, it sounds like I could change the default block size to 1MB, and
> write a MapReduce that simply reads/writes the file. I assume one block
> is replicated across a few machines.
> 
> Is there a example code, which you are aware of, for using HDFS for this
> purpose?  [1] Or maybe HDFS isn't designed for this task.
> 
> Best,
> Jonathan
> 
> [1] http://wiki.apache.org/lucene-hadoop/LibHDFS points to your code for
> libhdfs,
> http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs/hdfs_test
> .c?view=mark
> up, so would my simple use case  be a C application   or was this code simply
> a
> single machine test?
> 
> 
> Also, interested if anyone had comparative thoughts on
> http://www.danga.com/mogilefs/ ?
> 
> Dhruba Borthakur wrote:
>> Hi Jonathan,
>> 
>> Thanks for asking this question. I think all your four requirements are
>> satisfied by HDFS. The one issue that I have is HDFS is not designed to
>> support a large number of small files, rather fewer number of larger files.
>> For example, the default block size is 64MB (is configurable).
>> 
>> That said, version 0.13 is tested to store about 14 million files. Version
>> 0.15 (to be released in October) should support about 4 times that number.
>> This limit will probably increase every passing week, but you should
>> consider this limitation for evaluating HDFS.
>> 
>> May I ask how many files you might have? How does the number of files grow
>> over time? How frequently are files accessed? Are you going to use HDFS as a
>> read-only archival system?
>> 
>> Thanks,
>> dhruba
>> 
>> -----Original Message-----
>> From: Jonathan Hendler [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, September 20, 2007 8:25 PM
>> To: [email protected]
>> Subject: HDFS instead of NFS/NAS/DAS?
>> 
>> Hi All,
>> 
>> I am a complete newbie to Hadoop, not having tested or installed yet,
>> but reading up for about a month now in spare time, and following the
>> list. I think it's really exciting to provide this kind of
>> infrastructure as open source!
>> 
>> I'll provide context for the subject of this email, and although I've
>> seen a thread  or two about storing many small files in Hadoop, I'm not
>> sure it addresses the following.
>> 
>> Goal:
>> 
>>    1. Many small files (from 1MB-2GB)
>>    2. Automated "fail-safe" redundancy
>>    3. Automated synchronization of the redundancy
>>    4. predictable speed as load / server count increases for read/write
>>       of these files (in part or whole)
>> 
>> The middleware having access to the files could be used, among other
>> things, to:
>> 
>>    1. track "where the files are", and their states
>>    2. sync differences
>> 
>> My thinking is that by splitting parts of these files, even if small,
>> across a number of machines, CRUD will be faster than NFS, as well as
>> "safer". Also, I'm thinking that using HDFS would be cheaper than DAS /
>> and more feature rich than NAS [1]. Also, it wouldn't matter "where" the
>> files were in HDFS, which would simplify the complexity of the
>> middleware. I also read that DHTs generally don't have intelligent load
>> balancing, making HDFS type schemes more consistent.
>> 
>> Since Hadoop is primarily designed to move the computation to where the
>> data is, does it make sense to use HDFS in this way?[2]
>> 
>> - Jonathan
>> 
>> [1] - http://en.wikipedia.org/wiki/Network-attached_storage#Drawbacks
>> [2] - (assuming the memory limit in the master isn't reached because a
>> large number of files/blocks)
>> 
>> 
>> 
>>   
> 
> 

Reply via email to