Hi Nathan, > I have a very general question on the usefulness of HDFS for purposes other > than running distributed compute jobs for Hadoop. Hadoop and HDFS seem very > popular these days, but the use of > HDFS for other purposes (database backend, records archiving, etc) confuses > me, since there are other free distributed filesystems out there (I > personally work on Lustre), with significantly better general-purpose > performance.
Coming it from a SysAdmin point of view I could say these things about HDFS - It's very easy to setup, basically 3 steps Build Box, Format NameNode, Start. It's easy to be up and running in Production mode HDFS in <4 hours with a 5+ machine cluster. (how big that "+" is dependent on your node deployment method, Kickstart, VMware, EC2 images, more than the complexity of HDFS setup) - At least in Hadoop 0.20 it lacks a lot of "Ops" features, getting a simple "ls" through an API/hadoop as you mention. - The 3x Data seems very wasteful to me - Multidisk deployments are difficult for monitoring to deal with. Adding a disk is easy, one xml file and a ",/new/disk_path". But, the 1st disk will go 100% and say there, while the new disk is perhaps 10% full (if you added the new disk when the old one was 90% full). They fill at equal rates. There is no easy way to balance the disks within the machine (the balance command works only across machines), so your Ops guys will get the "Disk 100% full" page 24x7 until they kill someone. :) I haven't done any benchmarking, but my expectations for speed aren't high with HDFS. Scott