As mentioned, I've been working on a proposal text for the cloud idea. Here's a first draft. Please have a look and let me know whether I've missed any important facts. Thanks.
I intend to post the proposal to the PSF board (of which I'm a member, in case you shouldn't know) and to have it vote on the proposal in one of the next board meetings. """ PSF-Proposal: 100 Title: Move PyPI static data to the cloud for better availability Version: Draft 1 Last-Modified: 2010-06-15 Author: m...@lemburg.com (Marc-André Lemburg) Discussions-To: catalog-sig@python.org Status: Draft Type: Informational Created: 2010-06-14 Post-History: Proposal: Move PyPI static data to the cloud for better availability ======================================================================== Motivation ---------- PyPI has in recent months seen several outages with the index not being unavailable to both users using the web GUI interface as well as package administration tools such as easy_install from setuptools. As more and more Python applications rely on tools such as easy_install for direct installation, or zc.buildout to manage the complete software configuration cycle, the PyPI infrastructure receives more and more attention from the Python community. In order to maintain its credibility as software repository, to support the many different projects relying on the PyPI infrastructure and the many users who rely on the simplified installation process enabled by PyPI, the PSF needs to take action and move the essential parts of PyPI to a more robust infrastructur that provides: * scalability * 24/7 system administration management * geo-localized fast and reliable access Current Situation ----------------- PyPI is currently run from a single server hosted in The Netherlands (ximinez.python.org). This server is run by a very small team of sys admin. PyPI itself has in recent months been mostly maintained by one developer: Martin von Loewis. Projects are underway to enhance PyPI in various ways, including a proposal to add external mirroring (PEP 381), but these are all far from being finalized or implemented. Usage ----- PyPI provides four different mechanisms for accessing the stored information: * a web GUI that is meant for use by humans * an RPC interface which is mostly used for uploading new content * a semi-static /simple package listing, used by setuptools * a static area /packages for package download files and documentation, used by both the web GUI and setuptools The /simple package listing is dump of all packages in PyPI using a simple HTML page with links to sub-pages for each package. These sub-pages provide links to download files and external references. External tools like easy_install only use the /simple package listing together with the hosted package download files. While the /simple package listing is currently dynamically created from the database in real-time, this is not really needed for normal operation. A static copy created every 10-20 minutes would provide the same level of service in much the same way. Moving static data to a CDN --------------------------- Under the proposal the static information stored in PyPI (meta-information as well as package download files and documentation) is moved to a content delivery network (CDN). For this purpose, the /simple package listing is replaced with a static copy that is recreated every 10-20 minutes using a cronjob on the PyPI server. At the same intervals, another script will scan the package and documentation files under /packages for updates and upload any changes to the CDN for neartime availability. By using a CDN the PSF will enable and provide: * high availability of the static PyPI content * offload management to the CDN * enable geo-localized downloads, i.e. the files are hosted on a nearby server * faster downloads * more reliability and scalability * move away from a single point of failure setup Note that the proposal does not cover distribution of the dynamic parts of PyPI. As a result uploads to PyPI may still fail if the PyPI server goes down. However, these dynamic parts are currently not being used by the existing package installation tools. Choice of CDN: Amazon Cloudfront -------------------------------- To keep the costs low for the PSF, Amazon Cloudfront appears to be the bext choice for CDN. Cloudfront is supported by a set of Python libraries (e.g. Amazon S3 lib and boto), upload scripts are readily available and can easily be customized. http://www.saltycrane.com/blog/2008/12/card-store-project-4-notes-using-amazons-cloudfront/ Other CDNs, such as Akamai, are either more expensive or require custom integration. Availability of Python-based tools is not always given, in fact, accessing such information is difficult for most of the proporietary CDNs. Cloudfront: quality of service ------------------------------ Amazon Cloudfront uses S3 as basis for the service, S3 has been around for years and has a very stable uptime: http://www.readwriteweb.com/archives/amazon_s3_exceeds_9999_percent_uptime.php Cloudfront itself has been around since Nov 2008. You can check their current online status using this panel: http://status.aws.amazon.com/ Apart from the gained availability and outsourced management, we'd also get faster downloads in most parts of the world, due to the local caching Cloudfront is applying. This caching can be used to further increase the availability, since we can control the expiry time of those local copies. So in summary we are replacing a single point of failure with N points of failure (with N being the number of edge caching servers they use). How Cloudfront works -------------------- Cloudfront uses Amazon's S3 storage system which is based on "buckets". These can store any number of files in a directory-like structure. The only limit is a 5GB per file limit - more than enough for any PyPI package file. Cloudfront provides a domain for each registered S3 bucket via a "distribution" which is then made available through local cache servers in various locations around the world. The management of which server to use for an incoming request is transparently handled by Amazon. Once uploaded to the S3 bucket, the files will be distributed to the cache servers on demand and as necessary. Each edge server server maintains a cache of requested files and refetches the files after an expiry time which can be defined when uploading the file to the bucket. To simplify things on our side, we'll setup a CNAME DNS alias for the Cloudfront domain issued by Amazon to our bucket: pypi-static.python.org. IN CNAME d32z1yuk7jeryy.cloudfront.net. For more details, please see the Cloudfront documentation: http://aws.amazon.com/documentation/cloudfront/ Integration ----------- In order to keep the number of changes to existing client side tools and PyPI itself to a minimum, the installation will try to be as transparent to both the server and the client side as possible. This requires on the server side: * few, if any changes to the PyPI code base * simple scripts, driven by cronjobs * a simple distributed redirection setup to avoid having to change client side tools On the client side: * no need to change the existing URL http://pypi.python.org/simple to access PyPI * redirects are already supported by setuptools via urllib2 Server side: upload cronjobs ---------------------------- Since the /simple index tree is currently being created dynamically, we'd need to create static copies of it at regular intervals in order to upload the content to the S3 bucket. This can easily be done using tools such as wget or curl. Both the static copy of the /simple tree and the static files uploaded to /packages then need to be uploaded or updated in the S3 bucket by a cronjob running every 10-20 minutes. Server side: downloads statistics --------------------------------- The next step would then be to configure access logs: http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/index.html?AccessLogs.html and add a cronjob to download them to the PyPI server. Since the format is a bit different than the Apache log format used by the PyPI software, we'd have two options: 1. convert the Cloudfront format to Apache format and simply append the converted logs to the local log files 2. write a Cloudfront log file reader and add it to the apache_count_dist.py script that updates the download counts on the web GUI Both options require no more than a few hours to implement and test. Server side: redirection setup ------------------------------ Since PyPI wasn't designed to be put on a CDN, it mixes static file URL paths with dynamic access ones, e.g. dynamic: http://pypi.python.org/pypi (and a few others) static: http://pypi.python.org/simple http://pypi.python.org/packages To move part of the URL path tree to a CDN, which works based on domains, we will need to provide a URL redirection setup that redirects client side tools to the new location. As Martin von Loewis mentioned, this will require distributing the redirection setup to more than just one server as well. Fortunately, this is not difficult to do: it requires a preconfigured lighttpd (*) setup running on N different servers which then all provide the necessary redirections (and nothing more): dynamic: http://pypi.python.org/ -> http://ximinez.python.org/pypi http://pypi.python.org/pypi -> http://ximinez.python.org/pypi (and possibly a few others) static: http://pypi.python.org/simple -> http://pypi-static.python.org/simple http://pypi.python.org/packages -> http://pypi-static.python.org/packages http://pypi.python.org/documentation -> http://pypi-static.python.org/documentation (note: pypi-static.python.org is a CNAME alias for the Cloudfront domain issued to the S3 bucket where we upload the data) The pypi.python.org domain would then have to be setup to map to multiple IP addresses via DNS round-robin, one entry for each redirection server, e.g. pypi.python.org. IN A 123.123.123.1 pypi.python.org. IN A 123.123.123.1 pypi.python.org. IN A 123.123.123.3 pypi.python.org. IN A 123.123.123.4 Redirection servers could be run on all PSF server machines, and, to increase availability, on PSF partner servers as well. (*) lighttpd is a lightwheight and fast HTTP server. It's easy to setup, doesn't require a lot of resources on the server machine and runs stable. Long-term changes ----------------- While enabling the above redirection setup, we should also start working on changing PyPI and the client tools to use two new domains which then cleanly separate the static CDN file access from the dynamic PyPI server access: pypi.python.org pypi-static.python.org Such a transition on the client side is expected to take at least a few years. After that, the redirection service can be shut down or used to distribute and scale the dynamic PyPI service parts. Side-effects ------------ Restarts of the PyPI server, network outages, or hardware failures would not affect the static copies of the PyPI on the CDN. setuptools, easy_install, pip, zc.buildout, etc. would continue to work. The S3 bucket would serve as additional backup for the files on PyPI. Later intergration with Amazon EC2 (their virtual server offering) would easily be possible for more scalability and reduced system administration load. Costs ----- Amazon charges for S3 and Cloudfront storage, transfer and access. The costs vary depending on location. http://aws.amazon.com/cloudfront/#pricing http://aws.amazon.com/s3/#pricing To get an idea of the costs, we'd have to take a closer look at the PyPI web stats: http://pypi.python.org/webstats/usage_201005.html In May 2010, PyPI transferred 819GB data and had to handle 22mio requests. Using the AWS monthly calculator this gives roughly (I used 37KB as average object size and 35% US, 35% EU, 10% HK, 10% JP as basis): USD 132 per month, or about USD 1,600 per year. Refinancing the costs --------------------- Since PyPI is being used as essential resource by many important Python projects (Zope, Plone, Django, etc.), it's fair to ask the respective foundations and the general Python community for donations to help refinance the administration costs. A prominent donation button should go the PyPI page with a text explaining how PyPI is being hosted and why donations are necessary. We may also be able to directly ask for donations from the above foundations. Details of this are currently being evaluated by the PSF board (there are some issues related to our non-profit status that make this more complicated than it appears at first). Effort ------ Given that most of the tools are readily available, setting up the servers shouldn't take more than 2-3 developer days for developers who've worked with Amazon S3 and Cloudfront before, including testing. It is expected that we'll find volunteers to implement the necessary changes. """ -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 15 2010) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2010-07-19: EuroPython 2010, Birmingham, UK 33 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig