Thanks for the replies. -----Original Message----- From: Allen Wittenauer [mailto:awittena...@linkedin.com] Sent: Friday, November 20, 2009 1:39 PM To: hdfs-user@hadoop.apache.org Subject: Re: HDFS questions
On 11/19/09 5:44 PM, "yiqi.t...@barclayscapital.com" <yiqi.t...@barclayscapital.com> wrote: > 1. It looks to be a user-space clustering file system, which means > root access is never needed during installation and on-going usage, right? It is a user-space file system, however, the user that is running the NameNode process is considered 'root' not the UNIX level root. It should also be pointed out that HDFS is not secure, so tricking the system to give anyone root privileges within HDFS is beyond trivial. Using HDFS for SOX data is a very difficult business. > 2. It looks to have a software raid scheme, and the default is 3 > copies, so if I have 3 boxes with 500gb disks each, the usable space > is just 500gb right? ~500gb, yes, if you ignore the MapReduce component. > 3. If I just have 1 box of 500gb disk, what is the usable space? It depends upon the configuration. If you run three different DataNode processes pointing to three different directories with a replication factor of 3, then it is ~166gb. If you configure one DataNode process with a replication factor of 1, then you get ~500gb. Running 'real' workloads on one machine is not recommended however. > 4. Does the HDFS show up in Linux OS as multiple physical files? Yes. > Of what size? If I change a file of 1 byte inside HDFS, I assume the > entire physical file on Linux OS is changed right? This will really > mess with tape backups... A) HDFS is fairly WORM: there is no random IO, so you aren't going to be changing 1 byte. You will be changing the whole file. B) Files at the UNIX level are written up to the block size. So if the block size is 128MB and you write a 512MB file in HDFS, you'll have 3x4 128MB size files at the UNIX level. C) I don't think anyone does tape backups of HDFS. The expectation is that the source of truth is elsewhere and that is getting backed up or you have two+ grids that have copies of each others data. I'd recommend going through some of the operations presentations on http://wiki.apache.org/hadoop/HadoopPresentations . It might help answer some of the how's and why's. :) _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________