Hi All,

I am a complete newbie to Hadoop, not having tested or installed yet,
but reading up for about a month now in spare time, and following the
list. I think it's really exciting to provide this kind of
infrastructure as open source!

I'll provide context for the subject of this email, and although I've
seen a thread  or two about storing many small files in Hadoop, I'm not
sure it addresses the following.

Goal:

   1. Many small files (from 1MB-2GB) 
   2. Automated "fail-safe" redundancy
   3. Automated synchronization of the redundancy
   4. predictable speed as load / server count increases for read/write
      of these files (in part or whole)

The middleware having access to the files could be used, among other
things, to:

   1. track "where the files are", and their states
   2. sync differences 

My thinking is that by splitting parts of these files, even if small,
across a number of machines, CRUD will be faster than NFS, as well as
"safer". Also, I'm thinking that using HDFS would be cheaper than DAS /
and more feature rich than NAS [1]. Also, it wouldn't matter "where" the
files were in HDFS, which would simplify the complexity of the
middleware. I also read that DHTs generally don't have intelligent load
balancing, making HDFS type schemes more consistent.

Since Hadoop is primarily designed to move the computation to where the
data is, does it make sense to use HDFS in this way?[2] 

- Jonathan

[1] - http://en.wikipedia.org/wiki/Network-attached_storage#Drawbacks
[2] - (assuming the memory limit in the master isn't reached because a
large number of files/blocks)

Reply via email to