Hi, Andrew > I second that, but how much you need is load dependent and > there are no clear formulas that I am aware of for plugging > in your estimated load and getting back a suitable > configuration estimate. Perhaps someday. I think not enough > operational experience is available at this stage. So for > now it's trial and error. I would start with 4 nodes, then > increase as necessary (by 2 or 4 datanode/regionserver > pairs each step) to spread load if you encounter DFS > errors or other errors related to loading. Such errors are > pretty easy to spot: Look for errors regarding blocks not > found, replication failures, heartbeat timeouts, lease > expiration, and such. Generally these have as a root cause > thread starvation from over loading. One telltale sign, > from HBase at least, is messages of the form "We slept > XXXXXX ms, ten times longer than expected". I give this > advice assuming that you will be running DFS, HBase, and > task trackers (therefore mapreduce mappers and reducers) > concurrently side by side.
Thank you for the advice. And yes, we are running DFS and Hbase side by side. > The extra large instance type is required for running HDFS > and HBase daemons side by side. Both are heap intensive > and require about 1G RAM per daemon just to start. Thanks. We should probably check the java options of the daemons to start them up w/ 1G. > Also, regarding EC2, it probably goes without saying, but > DO NOT use the S3 filesystem to back your HBase tables. > Use local HDFS + HBase on the nodes, and use Hadoop distcp > to back up and import to/from S3 if you need it for > persistence. That's ok. We don't do that :) Thank you for your cooperation, M.
