I am a complete newbie to the wonderful world of Amazon services so I apologize if I am asking a question that has already been answered.
I am looking for the easiest way to bring up an HBase and Hadoop environment as the persistence mechanism for a Grails based web application. I was not entirely clear which of the myriad of services offered provides the best approach - EC2, S3, Elastic Map/Reduce, etc etc - until the previous post pointed me towards EC2 over S3. Am I correct in understanding that a farm of EC2 instances with Hadoop and HBase installed and configured individually by myself are the quickest and most effective way to progress with this effort? Jean-Daniel Cryans-2 wrote: > > Hi users, > > I've recently helped debugging a 0.19 HBase setup that was using S3 as > its DFS (one of the problem is discussed in another thread) and I > think I've gathered enough information to guide new users on whether > this is a valuable solution. > > Short answer: don't use it for user-facing apps, consider it for > elastic EC2 clusters. > > Long answer: > > The main reason why you would want to store your data inside S3 would > be because of the marketed high availability and infinite scalability. > As the website says: "It gives any developer access to the same highly > scalable, reliable, fast, inexpensive data storage infrastructure that > Amazon uses to run its own global network of web sites. The service > aims to maximize benefits of scale and to pass those benefits on to > developers." BTW I don't refute any of this as in my experience this > has been mostly true. > > HBase can use any filesystem supported in Hadoop, including S3, so it > seems like a no brainer to use it instead of having to setup Hadoop. > Yes indeed, but... > > - You absolutely have to deploy your region servers in EC2 because of > the obvious latency and bandwidth every filesystem access will occur. > - The way the S3 code works in Hadoop, it writes on disk every inbound > and outbound file. Apart from slowing down even more every operation, > if you didn't change the hadoop.tmp.dir it will write in /tmp and that > volume on EC2 is always very very small. In fact, the first thing I > had to debug was a "No space left on device" which seems weird since > S3 should have infinite storage, but the error was really given when > data was written in the tmp folder. > - There are some unknown interactions because HBase has a very > different file usage pattern than MapReduce jobs and was optimized for > HDFS, not distant networked storage. > > So if you need speed, simply don't use S3 with HBase as it will be too > slow . You can consider using it for elastic MapReduce jobs the same > way people use it with Hadoop because you don't have to keep all the > nodes up all the time. > > J-D > > -- View this message in context: http://www.nabble.com/On-storing-HBase-data-in-AWS-S3-tp25794704p25884592.html Sent from the HBase User mailing list archive at Nabble.com.