> Not always being able to read back an object that has been written is deadly. > Having the S3 client cache written > data for a while can help but isn't a > complete solution because the RS can fail and its regions will be reassigned > > to another RS... who then might not be able to read the data. A region > might bounce around the cluster taking > exceptions on open for a while. This availability problem could eventually > stall all clients. To address this, you > could implement a distributed write-behind cache for S3, but is it worth the > effort and added complexity?
Argh. Eventual consistency bites. Perhaps HDFS on EBS is the only viable solution after all. The trouble is cost - S3 is 14 cents a GB-month, with full redundancy (whatever that means), whereas EBS is 10 cents a GB-month. EBS' redundancy may not really be adequate. So you probably need 2 or 3 HDFS block replicas, so EBS storage may cost 20 cents a GB-month or 30 cents a GB-month, depending on your pain threshold. I am most interested in running HBase well in the cloud - EC2 and other OpenStack based IaaSes. Thanks for sharing your insights, Andrew. Jagane
