On 11/19/09 5:44 PM, "yiqi.t...@barclayscapital.com" <yiqi.t...@barclayscapital.com> wrote: > 1. It looks to be a user-space clustering file system, which means root > access is never needed during installation and on-going usage, right?
It is a user-space file system, however, the user that is running the NameNode process is considered 'root' not the UNIX level root. It should also be pointed out that HDFS is not secure, so tricking the system to give anyone root privileges within HDFS is beyond trivial. Using HDFS for SOX data is a very difficult business. > 2. It looks to have a software raid scheme, and the default is 3 copies, > so if I have 3 boxes with 500gb disks each, the usable space is just > 500gb right? ~500gb, yes, if you ignore the MapReduce component. > 3. If I just have 1 box of 500gb disk, what is the usable space? It depends upon the configuration. If you run three different DataNode processes pointing to three different directories with a replication factor of 3, then it is ~166gb. If you configure one DataNode process with a replication factor of 1, then you get ~500gb. Running 'real' workloads on one machine is not recommended however. > 4. Does the HDFS show up in Linux OS as multiple physical files? Yes. > Of what size? If I change a file of 1 byte inside HDFS, I assume the entire > physical file on Linux OS is changed right? This will really mess with > tape backups... A) HDFS is fairly WORM: there is no random IO, so you aren't going to be changing 1 byte. You will be changing the whole file. B) Files at the UNIX level are written up to the block size. So if the block size is 128MB and you write a 512MB file in HDFS, you'll have 3x4 128MB size files at the UNIX level. C) I don't think anyone does tape backups of HDFS. The expectation is that the source of truth is elsewhere and that is getting backed up or you have two+ grids that have copies of each others data. I'd recommend going through some of the operations presentations on http://wiki.apache.org/hadoop/HadoopPresentations . It might help answer some of the how's and why's. :)