RE: HDFS questions

yiqi.tang Fri, 20 Nov 2009 11:20:31 -0800

Thanks for the replies.

-----Original Message-----
From: Allen Wittenauer [mailto:awittena...@linkedin.com] 
Sent: Friday, November 20, 2009 1:39 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS questions

On 11/19/09 5:44 PM, "yiqi.t...@barclayscapital.com"
<yiqi.t...@barclayscapital.com> wrote:
> 1. It looks to be a user-space clustering file system, which means 
> root access is never needed during installation and on-going usage,
right?

It is a user-space file system, however, the user that is running the
NameNode process is considered 'root' not the UNIX level root.  It
should also be pointed out that HDFS is not secure, so tricking the
system to give anyone root privileges within HDFS is beyond trivial.
Using HDFS for SOX data is a very difficult business.

> 2. It looks to have a software raid scheme, and the default is 3 
> copies, so if I have 3 boxes with 500gb disks each, the usable space 
> is just 500gb right?

~500gb, yes, if you ignore the MapReduce component.

> 3. If I just have 1 box of 500gb disk, what is the usable space?

It depends upon the configuration.  If you run three different DataNode
processes pointing to three different directories with a replication
factor of 3, then it is ~166gb.  If you configure one DataNode process
with a replication factor of 1, then you get ~500gb.

Running 'real' workloads on one machine is not recommended however.

> 4. Does the HDFS show up in Linux OS as multiple physical files?

Yes.

> Of what size? If I change a file of 1 byte inside HDFS, I assume the 
> entire physical file on Linux OS is changed right? This will really 
> mess with tape backups...

A) HDFS is fairly WORM:  there is no random IO, so you aren't going to
be changing 1 byte.  You will be changing the whole file.

B) Files at the UNIX level are written up to the block size.  So if the
block size is 128MB and you write a 512MB file in HDFS, you'll have 3x4
128MB size files at the UNIX level.

C) I don't think anyone does tape backups of HDFS.  The expectation is
that the source of truth is elsewhere and that is getting backed up or
you have
two+ grids that have copies of each others data.

I'd recommend going through some of the operations presentations on
http://wiki.apache.org/hadoop/HadoopPresentations .  It might help
answer some of the how's and why's. :)

_______________________________________________

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
sent from other members of the Barclays Group.
_______________________________________________

RE: HDFS questions

Reply via email to