Hello Greg, I've been looking into this for a bit as well. I've run across several successful examples, and I'm rolling out a cluster now. You sound like you have done more research then me, but I have run across two caveats so far:
- Definitely do a staged rollout, testing nodes coming up and down along the way. HDFS can lose and add nodes on the fly: but it's not quite plug & play. For example, one of my first mistakes was copying hdfs conf AND data directories to a new node. This confused the namenode, and I ended up reformatting the cluster to recover. And so on. - You did mention large files, but large is relative ;). Check your median file size: It should be some multiple of the chunk size (64 MB). Any smaller, and you'll have issues. If you do have lots of small files (I do), you may want to consider storing them in Hbase. (I am) So far, I've been happy. Just be careful, and read the docs. If something goes awry with the cluster, it's hard to find another place to offload PB's of data, but this is true for any solution. Good luck! Take care, -stu ------Original Message------ From: Greg Connor To: hdfs-user@hadoop.apache.org ReplyTo: hdfs-user@hadoop.apache.org Subject: Using HDFS just for storage Sent: Feb 24, 2010 16:50 Hello HDFS users, We are considering using Hadoop just as a clustered storage solution, and I'm wondering if anyone has used it like this, and might have some experiences or wisdom to share? We need to distribute lots of large files over 30+ machines, and HDFS seems to have all the right features, including replication, reacting automatically to downed nodes, etc. From a features point of view, it seems to be a good fit, but I really want to know if this is backed up by any real-world experience. First concern I have: Some of our initial throughput tests show that transferring files into and out of HDFS is noticeably slower than just a straight copy to the machine would be... I was hoping the throughput would be the same, or better in cases where my hadoop client machine can talk to many datanodes at once. Is this lower copy throughput expected, or is there perhaps something I've failed to tune? My other concern would be, what would happen if we set the default replication to 2... I know 3 is the customary setting but we really need to keep the costs down. Does anyone have real-world experience with maintaining a medium-sized farm with replication set to 2? Anything to watch out for? Thanks for any feedback. You can write me directly and I'll be happy to summarize findings back to the list if there is interest. gregc Sent from my Verizon Wireless BlackBerry