Hello Greg,
 I've been looking into this for a bit as well. I've run across several 
successful examples, and I'm rolling out a cluster now. 
 You sound like you have done more research then me, but I have run across two 
caveats so far:

  - Definitely do a staged rollout, testing nodes coming up and down along the 
way. HDFS can lose and add nodes on the fly: but it's not quite plug & play. 
For example, one of my first mistakes was copying hdfs conf AND data 
directories to a new node. This confused the namenode, and I ended up 
reformatting the cluster to recover. And so on.

- You did mention large files, but large is relative ;). Check your median file 
size: It should be some multiple of the chunk size (64 MB). Any smaller, and 
you'll have issues. If you do have lots of small files (I do), you may want to 
consider storing them in Hbase. (I am)

So far, I've been happy. Just be careful, and read the docs. If something goes 
awry with the cluster, it's hard to find another place to offload PB's of data, 
but this is true for any solution. Good luck!

Take care,
 -stu
------Original Message------
From: Greg Connor
To: hdfs-user@hadoop.apache.org
ReplyTo: hdfs-user@hadoop.apache.org
Subject: Using HDFS just for storage
Sent: Feb 24, 2010 16:50

Hello HDFS users,

We are considering using Hadoop just as a clustered storage solution,
and I'm wondering if anyone has used it like this, and might have some
experiences or wisdom to share?

We need to distribute lots of large files over 30+ machines, and HDFS
seems to have all the right features, including replication, reacting
automatically to downed nodes, etc.  From a features point of view, it
seems to be a good fit, but I really want to know if this is backed up
by any real-world experience.

First concern I have: Some of our initial throughput tests show that
transferring files into and out of HDFS is noticeably slower than just a
straight copy to the machine would be... I was hoping the throughput
would be the same, or better in cases where my hadoop client machine can
talk to many datanodes at once.  Is this lower copy throughput expected,
or is there perhaps something I've failed to tune?

My other concern would be, what would happen if we set the default
replication to 2... I know 3 is the customary setting but we really need
to keep the costs down.  Does anyone have real-world experience with
maintaining a medium-sized farm with replication set to 2?  Anything to
watch out for?

Thanks for any feedback.  You can write me directly and I'll be happy to
summarize findings back to the list if there is interest.

gregc



Sent from my Verizon Wireless BlackBerry

Reply via email to