RE: HDFS without Hadoop: Why?

Scott Golby Tue, 25 Jan 2011 14:05:14 -0800

Hi Nathan,

> I have a very general question on the usefulness of HDFS for purposes other 
> than running distributed compute jobs for Hadoop.  Hadoop and HDFS seem very 
> popular these days, but the use of
> HDFS for other purposes (database backend, records archiving, etc) confuses 
> me, since there are other free distributed filesystems out there (I 
> personally work on Lustre), with significantly better general-purpose 
> performance.


Coming it from a SysAdmin point of view I could say these things about HDFS


-          It's very easy to setup, basically 3 steps Build Box, Format 
NameNode, Start.  It's easy to be up and running in Production mode HDFS in <4 
hours with a 5+ machine cluster. (how big that "+" is dependent on your node 
deployment method, Kickstart, VMware, EC2 images, more than the complexity of 
HDFS setup)

-          At least in Hadoop 0.20 it lacks a lot of "Ops" features, getting a 
simple "ls" through an API/hadoop as you mention.

-          The 3x Data seems very wasteful to me

-          Multidisk deployments are difficult for monitoring to deal with.  
Adding a disk is easy, one xml file and a ",/new/disk_path".   But, the 1st 
disk will go 100% and say there, while the new disk is perhaps 10% full (if you 
added the new disk when the old one was 90% full).   They fill at equal rates.  
 There is no easy way to balance the disks within the machine (the balance 
command works only across machines), so your Ops guys will get the "Disk 100% 
full" page 24x7 until they kill someone. :)

I haven't done any benchmarking, but my expectations for speed aren't high with 
HDFS.

Scott

RE: HDFS without Hadoop: Why?

Reply via email to