> This thread has been sitting in my inbox waiting for me to have time to read > it. I saved it specifically because I am in this boat. I appreciate all who > have contributed on both sides. > > I would like to suggest that I think this would be an amazing topic for a > PLUG meeting, or even a series of meetings. > > I am currently working on a (potentially very large) application project > which is currently being hosted in "the cloud" for sake of cost, with the > intention that in the possibly near future we would look to a more > traditional datacenter solution. However, the concept of using a different > mindset to make the cloud actually be the more viable solution intrigues me. > I have heard reports of resource latency affecting large applications which > make the cloud a potentially bad place to be. But if there are in fact ways > to mitigate limitations that may simply be caused by using traditional > mindsets, I want to know more. > > Thoughts? > > Jonathan
We let's look at a quick case study. Our first significant "cloud" based project is www.lccatalog.com It's an ecommerce site that we decided couldn't, tolerate downtime, since downtime translates quite literally into lost sales and lost reputation. In a traditional small ecommerce app you have a webserver and a database server running on the same machine. This gives you a single point of failure since if the box fails you have no website and no DB. For a larger install you might decide to split the DB server off onto a separate physical box so that the webserver could be dedicated solely to serving up web pages. Doing that though increases latency since you need to use a remote DB connection. You have also introduced 2 distinct points of failure, your webserver could go down, or your DB server could go down. If either goes down your customer will see it as a broken website. To address this issue we decided to not use a DB server at all and instead structure the ecommerce application to use AWS SimpleDB which is a NOSQL schemaless DB that is essentially a large hashmap. However it does present itself as a traditional DB and calls are executed on the backend via HTTP GET/POST instead of a more traditional ODBC/JDBC setup. (Our app is in Java and runs on Tomcat). We could have chosen SimpleRDS, but felt that SimpleDB was a better choice since we don't need all the features of a full on RDS, such as multi-table queries, joins and stored procedures. Going this route allowed us to get rid of the DB as a single point of failure at the cost of having much slower DB access in some instances. We worked around that by implementing a sort of memcache system in the application. Basically any Select request is first checked against a recent results list (in our case anything < 24hrs is considered recent). If there was already a result in the cache we return the cached result. This is lightning fast. If however a result was not cached or it has expired then we query SimpleDB for the object which generally has a turn around time of less 50ms for most result sets, however larger result sets (such as exporting the entire product DB) may end up taking as much as 5 seconds. Something to note about SimpleDB and other schemaless solutions such as AppEngine's DataStore is that there is an implicit limit to result set size. If you reach that limit it will return all of the results, thus far and then you have query the API for a "Next" token and resubmit until no more "next" tokens are present. This changes the way that you think about it. Also storing multiple objects into the DB requires using a batch process, that in the case of SimpleDB seems to have a limit of 25 objects at a go. This made writing the product import and editing tools a rather interesting process. Anyways building the application this way gives us what I consider great performance but we took it a step further. We use Amazons CloudFront and an S3 bucket to serve all static media files. This keeps our WAR file size smaller and allows us to serve static media content such as our css, javascripts and of course images at a server sitting at an edge location close to the user making the request. Again performance is increased because images are already physically closer to the end-user and if an we need to change an image we only have to change it in the S3 bucket instead of uploading a new WAR and telling all running instances to use that new archive for image data. Finally we used Amazon's Elastic Beanstalk for deployment. We run a minimum of 2 instances using the "ANY2" method which keeps our instances in distinctly different physical locations. That way if one datacenter goes belly up, we still have a server running somewhere else and EB handles firing up additional instances during peak load times, and shutting the extra instances down when the load subsides. It does so seamlessly, we don't have to do anything different to auto-scale. Our total AWS infrastructure bill for the month of November was $25.00 USD. However we are still under the "free trial period" and the calculator shows that similar usage outside of the free trial will run us about $60 per month. I consider that rather reasonable for what we are getting. The trade-off being we had to write the entire ecommerce app from scratch (although a month after we started down the "from scratch road" we learned about SimpleJPA which would have allowed us to use any Java EE ecommerce app and simply modify it to tastes, oh well). That's the good news. The bad news is that SimpleDB uses a default of "eventually" consistent results and as far a I can tell there is nothing in the Java API that would allow us to override that. This has directly been the cause of a significant bug where if a customer views their cart (which is a session variable) and then moves to the checkout screen, about 1 in 10 times their checkout screen will show a $0 dollar total. This is because when a customer moves from the view cart screen to the checkout screen, the session cart is inserted into the DB as "cart" object then the checkout screen then queries the DB for a cart object tied to the customer's sesssion ID and treats the DB cart results as cannonical information (because checkout may well happen on a different instance). Fortunately the consistency corrects itself in a second or two, so by the time they actually input credit card information and hit submit we do process the correct total to the credit card. Also we have an auditor (usually the call center manager), sanity check each transaction before sending a ship product order to fulfillment, so thus far no one has received any unnecessarily free products, at least not to my knowledge :-) But still bugs like this are disconcerting and will require a "rethink" in the way we handle this section of the order process. Most times though the customer sees the $0 total and either hits the back button or the refresh button or calls customer service (who tell them to hit refresh). We have also introduced a 2 second sleep into the redirect from cart to checkout that has all but eliminated this problem. Still it feels like a hack and I'm not happy with that. But it does work correctly 90% of the time. In a nutshell thinking differently from the beginning allows you to create rock solid stable, fast and scalable applications, but requires a completely different mindset than what you may be used to. You have to start thinking of the entire hardware stack as a fungible asset that may be replaced at any moment with another completely fungible assets. The only guarantee you really get is that if your code runs on one box, it will run the same (or possibly better), on any of the other boxes that may replace the one you started with. You also have a situation where data may at anytime be in a transient state and so a "measure twice, cut once" mindset is required when thinking about your application and it's data. Finally for us grey beards, direct SSH administration of a box while possible, is not advisable since changes don't necessarily make the transition from one box to another. By using an S3 backed EBS volume and mounting it to each instance to use as a datastore for configuration data, you can negate that somewhat, but I haven't found a way to do this in seamlessly under Elastic Beanstalk. Furthermore I have had EBS volumes suddenly become unusable or disappear completely, leaving me to have to rebuild them from backups. I have also had certain individual media files disappear from S3 but at least S3 is smart enough to let me know about it automatically (via an email alert), so I can get it back up as soon as possible. This happened to our company logo once during a peak load time and while it was embarrassing, it wasn't fatal. If you store all configuration in a WAR file you negate most of the above issues though. And S3 has been at least as reliable if not more so than many "server grade" drives we've used on web-servers in the past. So those are my thoughts, anyone else care to chime in? /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
