That sounds a lot like a twelve-factor app :) http://12factor.net
-- justin I'm not sure what news you've been reading but.... The London Airport was shutdown due to a system failure and a backup system that utterly failed to do it's job http://www.bbc.co.uk/news/uk-25281675 BART was shut down due to a computer failure http://blogs.kqed.org/newsfix/2013/11/22/no-bart-service-this-morning-due-to-computer-glitch/ RBS left it's customer high and dry and unable to access their accounts http://www.channel4.com/news/rbs-already-under-investigation-over-computer-failures Another airport shutdown because of a failure http://www.keysnet.com/2013/12/04/492971/southwest.html A computer failure allowed bad meat to ship resulting in a recall http://www.kltv.com/story/23974187/nationwide-computer-failures-cause-millions-of-pound-of-meat-to-go-uninspected-weekly These are all examples of notable failures in the last month. They were big enough that they made the news. None of them were "cloud" services. All of them had very significant impact. At least when I host my systems with a provider I don't have to worry about mean time between failures and replacing systems that go bad. In fact if you have monitoring setup correctly to watch for important metrics; then when something goes wrong on your system you just spin the old one down and spin up a new one. In fact all of my deployments do this automatically and I just get an email when it's detected and again when it's done. It's not a panacea, but what I spend in hosting costs for cloud services, would easily be dwarfed by the costs of colo for my own boxes and time & effort spent to monitor them and replace something when it goes awry. Your point about mysql is valid. MySQL is not very good in a situation where the storage is remote like on an NFS or s3fs mount. If the link goes down, MySQL will never recover without a reboot. You're generally much better off using different technologies and rethinking your application. In general if I come to a point in my design where I'm looking at an RDS as the solution, I tend to wonder where I've failed in my design. For long term storage or anything that doesn't need high availability, but needs the structure an RDS provides, sometimes there is no alternative. Usually though, there is. In most cases it's just a matter of thinking differently about your data. If I must go with an RDS, I do make sure that it has local storage as in Amazon SimpleRDS. One final thing to note. You should not assume that you are running on anything approaching modern hardware. An amazon t1.micro instance has about the same specs as my 3 year old cellphone. Something approaching modern specs is going to cost you about $0.35/hr vs the 0.004/hr of a micro instance. On the whole, it's better to think of these instance things as dedicated task processors rather than modern hardware that can run umpteen million services. You spin one up to run a specific task in a complicated work flow. If you do it correctly, you load balance that workflow across multiple parallel instances and spin them down when the job is complete. For example I have a customer who is a professional photographer/videographer for high end clients (models, celebs and other people with more dollars than sense). He needs to index and process an absolutely huge amount of photos and videos, there is no way he could do this by hand. I built a dedicated facial recognition & tagging system running on AWS. I based my design for this service on something similar I did for a missing kids service in china. In all it's comprised of 1 web server, 1 database server, 1 facial recognition engine, 1 image reprocessing engine and a whole bunch of S3 storage (with automated backup to glacier). When he uploads new photos to the site, the upload is sent direct to S3. The website sends a message to the facial recognition engine which then begins to process the images & videos and look for who's in them. The actual engines are kept offline until needed and the control service spins up n mod 10 instances for images and n for videos. In otherwords the number of instances running, is entirely dependent upon the number of images & videos that need to be processed. The control server will spin them back down when their workload is complete. Once indexing is complete, a reference to the file along with the tags created, are batched up and sent to the image reprocessing engine. This service will embed the tags as metadata into directly into the file. The tags and a file reference are then stored in the database for later queries by the website. The decision to embed the tags directly in the files is actually a failsafe in case the DB becomes unrecoverable or the backup is too stale. If that happens, you can just re-index the images without rerunning the workflow. In the time since I built this, there is now a "cloud based" workflow service from Amazon targeted at exactly this sort of workflow. If he ever decides to make a 2.0 version I'll be leveraging that instead. Back to the original point... In the year and a half that this has been operational, there has been no downtime and no "lost" data or images. The work load is medium to high but very spiky. Averaged out he's putting about 10GB of data into the system daily. Time is money, he doesn't want to have to deal with ANY downtime. On the other hand not a single one these instances has had uptime in excess of a week. It seems like something is always going wrong, but when I built it, I built it to self heal. When I went into this I was well aware of uptime & availability issues from cloud providers, especially AWS. But I try to never design any system where too many eggs are in a single basket.So I built this in a distributed workload fashion with proper monitoring, alerting & repair scripts. The DB is non-responsive? Spin it down, spin up a new one. The webserver is down? Deploy a new one and repoint the DNS to it. File missing from S3? Call glacier and tell it to bring it back. You can't just move an existing application to the "cloud" and have it do anything other than just "sorta work". If you're going to be using these systems you need to know what they are, how they work and most importantly what their failure modes are. Every design has to be undertaken with the "what happens WHEN this part falls down?". When you build for this you need to take failure of each component as an inevitability. It's a forgone conclusion that something will fail at the absolute worst moment. You need to always have something watching for signs of failure and ready to replace it when it does. Then you also need to have something watching that :) On Wed, Dec 11, 2013 at 7:03 PM, Sasha Pachev <[email protected]> wrote: > >Not picking on you but do you honestly think someone hosting their own > >server will have better uptime than using one of the current top tier > cloud > >providers? > > I do not work much with clouds, but I have had some experiences that > makes me wonder about the stability of the current cloud solutions: > > * I have seen MySQL stuck due to failed I/O several times on Amazon > cloud. Never quite like that on a dedicated machine - not so > spectacularly where every read() syscall would just sit there > indefinitely instead of coming back with some kind of an error. > * Netflix outage due to cloud failure made the news recently. I do not > recall a major news item that had to do with a regular dedicated > server failure. In fact, it was quite exciting - does not happen > often. > * I ran the Big Cottonwood Canyon Half-Marathon this year. When I got > home I went to their website to check the results and got an error > several times. Retried several times after giving it some time to > auto-heal or have the admin take care of it. Then after some time the > site started loading, but was extremely slow. I saw the domain of the > backend scripts was rhcloud.com. I realize that a poorly written PHP > script combined with a poorly written MySQL query can produce some > wonderful results, but you can only botch it so much while fetching > only 5K records on modern hardware. I have seen horrendously > inefficient code perform just fine even under load on a normal > dedicated server. > > Now the idea of clouds is great. However, I fear that our ability to > get excited about them exceeds our ability to implement them properly > which is not an easy task. > > -- > Sasha Pachev > > Fast Running Blog. > http://fastrunningblog.com > Run. Blog. Improve. Repeat. > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */ /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
