Steven D'Aprano wrote: > On Tue, 15 Jun 2010 09:49:03 pm M.-A. Lemburg wrote: >> As mentioned, I've been working on a proposal text for the cloud >> idea. Here's a first draft. Please have a look and let me know >> whether I've missed any important facts. Thanks. > > I think the most important missed fact is, just how unreliable is PyPI > currently? Does anyone know? > > I know there's a number of people complaining that it's down "all the > time", or even occasionally, but I think that we need to know the > magnitude of the problem that needs solving. What's the average length > of time between outages? What's the average length of the outage? Just > saying that there's been several outages in recent months is awfully > hand-wavy.
I'm sorry, but I can't provide any numbers since there doesn't appear to be any monitoring in place to pull those numbers from. What I can say is that from reading the various mailing lists, PyPI is down often enough to let people start discussions about it and that's the point I want to address: """ In order to maintain its credibility as software repository, to support the many different projects relying on the PyPI infrastructure and the many users who rely on the simplified installation process enabled by PyPI, the PSF needs to take action and move the essential parts of PyPI to a more robust infrastructur that provides: * scalability * 24/7 system administration management * geo-localized fast and reliable access """ Setting up some Zenoss or Nagios monitoring system to take care of monitoring the PyPI server (and our other servers) would be a separate project. > [...] >> Amazon Cloudfront uses S3 as basis for the service, S3 has been >> around for years and has a very stable uptime: >> >> http://www.readwriteweb.com/archives/amazon_s3_exceeds_9999_percent_u >> ptime.php > > Is there anyone here who has personal experience with Cloudfront and is > willing to vouch for it? Or argue against it? We can only go so far > based on Amazon's marketing material. I don't have personal experience with Cloudfront, but have advised companies to use Amazon EC2 and S3 as disaster recovery and backup solution. So far, none of them has ever complained. While doing research for the proposal, I've read a lot of posts about people using Amazon S3 and Cloudfront. The overall feedback is very positive. If things still don't work out for us, we can always go back to the single server setup. The proposal doesn't bind us to Cloudfront or the CDN setup in any way. > One thing that does worry me: > >> So in summary we are replacing a single point of failure with N >> points of failure (with N being the number of edge caching servers >> they use). > > I don't think this means what you seem to think it means. If you replace > a single point of failure with N points of failure, your overall > reliability goes down, not up, since there are now more things to go > wrong. Assuming that they're independent points of failure, that means > your total number of failures will increase by a factor of N. > > For example, if a single edge server in (say) Australia goes down, > Amazon might not count it as an outage for the purpose of calculating > their 99.99% reliability since the system as a whole is still up, but > conceivably Australian users might see an outage (or at least a > slow-down). With N servers, I'd expect N times the number of individual > outages, with Amazon presumably only counting it as "system down" if > all N servers go down at the same time. It's poor wording, I agree. Thanks for pointing this out. The math is correct, though, I believe... Let's say all servers have a probability of being unavailable of P("Server down") = q (with q in [0,1]). Let's further assume that all servers are independent of each other. The probability of none of the servers being available then is P("System down") = q^N <= q Cloudfront uses a DNS round-robin system with a TTL of 60 seconds, and returns more than just one cache server per edge node, e.g. in Germany I get 8 cache servers: > dig d1ylr6sba64qi3.cloudfront.net ;; ANSWER SECTION: d1ylr6sba64qi3.cloudfront.net. 57 IN CNAME d1ylr6sba64qi3.ams1.cloudfront.net. d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.184 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.250 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.84 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.106 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.15 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.102 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.40 d1ylr6sba64qi3.ams1.cloudfront.net. 57 IN A 216.137.59.118 ;; AUTHORITY SECTION: ams1.cloudfront.net. 141251 IN NS ns-ams1-01.cloudfront.net. ams1.cloudfront.net. 141251 IN NS ns-ams1-02.cloudfront.net. The probability of all 8 server being down is P("Edge node down") = q^8 <= q Assuming that Amazon's system monitoring is fast enough to detect the edge node down state, it will likely switch me over to a different edge within those 60 seconds, where I'll see another 8 or so servers: P("2 edge nodes unavailable") = q^8 * q^8 = q^16 and so on. Now compare all this to the probability of the single PyPI server being down: P("PyPI server down") = q >> q^N = P("Cloudfront down") In other words, the probability for PyPI on the CDN being unreachable for more than say 5 minutes (assuming the switchover to all edge nodes takes at most 5 minutes), is q^N. In numbers: Let's assume that q=0.01, ie. 99% uptime, with N=32 (the true number is likely higher): P("PyPI server down") = 0.01 >> P("Cloudfront down") = 0.01^32 = 1e-64 Of course, you'd have to add an offset of the Amazon infrastructure or network connectivity being down, human error, inherent system failures and DDoS attacks, so the actual numbers are higher. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 15 2010) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2010-07-19: EuroPython 2010, Birmingham, UK 33 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig