Re: HDFS without Hadoop: Why?

Gerrit Jansen van Vuuren Tue, 25 Jan 2011 15:56:32 -0800

Hi,

Why would 3x data seem wasteful?
This is exactly what you want.  I would never store any serious business
data without some form of replication.
What happens if you store a single file on a single server without replicas
and that server goes, or just the disk on that the file is on goes ? HDFS
and any decent distributed file system uses replication to prevent data
loss. As a side affect having the same replica of a data piece on separate
servers means that more than one task can work on the server in parallel.


I do agree that balancing data between disks on the same node is lacking
currently. Filling up a single disk in hdfs and adding a new one is not that
big a problem. Each DataNode will save blocks on the new disk and not just
keep on filling the already full disk. For data durability this is also not
a problem because with 3x replication you will have 1 replica on the local
node, 1 in the same rack plus another off rack.

With hadoop fs -ls <file> you get a listing of files , Yes its lacking, but
I've found that if you use awk you can get the columns you want, Its a bit
tedious I know.

Cheers,
 Gerrit


On Tue, Jan 25, 2011 at 10:05 PM, Scott Golby <sgo...@conductor.com> wrote:

> Hi Nathan,
>
>
>
> > I have a very general question on the usefulness of HDFS for purposes
> other than running distributed compute jobs for Hadoop.  Hadoop and HDFS
> seem very popular these days, but the use of
>
> > HDFS for other purposes (database backend, records archiving, etc)
> confuses me, since there are other free distributed filesystems out there (I
> personally work on Lustre), with significantly better general-purpose
> performance.
>
>
>
> Coming it from a SysAdmin point of view I could say these things about HDFS
>
>
>
> -          It’s very easy to setup, basically 3 steps Build Box, Format
> NameNode, Start.  It’s easy to be up and running in Production mode HDFS in
> <4 hours with a 5+ machine cluster. (how big that “+” is dependent on your
> node deployment method, Kickstart, VMware, EC2 images, more than the
> complexity of HDFS setup)
>
> -          At least in Hadoop 0.20 it lacks a lot of “Ops” features,
> getting a simple “ls” through an API/hadoop as you mention.
>
> -          The 3x Data seems very wasteful to me
>
> -          Multidisk deployments are difficult for monitoring to deal
> with.  Adding a disk is easy, one xml file and a “,/new/disk_path”.   But,
> the 1st disk will go 100% and say there, while the new disk is perhaps 10%
> full (if you added the new disk when the old one was 90% full).   They fill
> at equal rates.   There is no easy way to balance the disks within the
> machine (the balance command works only across machines), so your Ops guys
> will get the “Disk 100% full” page 24x7 until they kill someone. :)
>
>
>
> I haven’t done any benchmarking, but my expectations for speed aren’t high
> with HDFS.
>
>
>
> Scott
>
>
>

Re: HDFS without Hadoop: Why?

Reply via email to