Hi, Why would 3x data seem wasteful? This is exactly what you want. I would never store any serious business data without some form of replication. What happens if you store a single file on a single server without replicas and that server goes, or just the disk on that the file is on goes ? HDFS and any decent distributed file system uses replication to prevent data loss. As a side affect having the same replica of a data piece on separate servers means that more than one task can work on the server in parallel.
I do agree that balancing data between disks on the same node is lacking currently. Filling up a single disk in hdfs and adding a new one is not that big a problem. Each DataNode will save blocks on the new disk and not just keep on filling the already full disk. For data durability this is also not a problem because with 3x replication you will have 1 replica on the local node, 1 in the same rack plus another off rack. With hadoop fs -ls <file> you get a listing of files , Yes its lacking, but I've found that if you use awk you can get the columns you want, Its a bit tedious I know. Cheers, Gerrit On Tue, Jan 25, 2011 at 10:05 PM, Scott Golby <sgo...@conductor.com> wrote: > Hi Nathan, > > > > > I have a very general question on the usefulness of HDFS for purposes > other than running distributed compute jobs for Hadoop. Hadoop and HDFS > seem very popular these days, but the use of > > > HDFS for other purposes (database backend, records archiving, etc) > confuses me, since there are other free distributed filesystems out there (I > personally work on Lustre), with significantly better general-purpose > performance. > > > > Coming it from a SysAdmin point of view I could say these things about HDFS > > > > - It’s very easy to setup, basically 3 steps Build Box, Format > NameNode, Start. It’s easy to be up and running in Production mode HDFS in > <4 hours with a 5+ machine cluster. (how big that “+” is dependent on your > node deployment method, Kickstart, VMware, EC2 images, more than the > complexity of HDFS setup) > > - At least in Hadoop 0.20 it lacks a lot of “Ops” features, > getting a simple “ls” through an API/hadoop as you mention. > > - The 3x Data seems very wasteful to me > > - Multidisk deployments are difficult for monitoring to deal > with. Adding a disk is easy, one xml file and a “,/new/disk_path”. But, > the 1st disk will go 100% and say there, while the new disk is perhaps 10% > full (if you added the new disk when the old one was 90% full). They fill > at equal rates. There is no easy way to balance the disks within the > machine (the balance command works only across machines), so your Ops guys > will get the “Disk 100% full” page 24x7 until they kill someone. :) > > > > I haven’t done any benchmarking, but my expectations for speed aren’t high > with HDFS. > > > > Scott > > >