thanks jim. you save our day. we'll send some austrian cheese over :)
jodok On 19.07.2007, at 13:06, Jim Fulton wrote:
Over the past few months, we've struggled quite a bit with Python Package Index (PyPI) performance and stability. Thanks to the heroic efforts of Martin v. Löwis and others, performance and especially stability have improved quite a bit. Martin has demonstrated that, at least when running well, PyPI seems to answer most requests on the order of 7 miliseconds (around 150 requests per second) internally. That's not bad. Unfortunately for users, actual times can be quite a bit longer. For me at work, request take around 300 milliseconds. For Martin, they seem to take somewhat longer. 300 milliseconds isn't so bad for a request or two, however, easy install can easily make 10s or even hundreds of requests to satisfy a user request for a package. zc.buildout, when verifying that a large system with many tens of packages has the most up to date versions of each package can easily make thousands of requests. Why do setuptools and buildout make so many requests? If a package exposes more than one release, then setuptools checks the package's main PyPI page and the pages for each release. We need to be able to easily use older releases, so we can't hide old releases. Typical projects of ours have many old releases exposed. If setuptools was more clever in the way it searched PyPI, but it would still have to make a minimum of 2 requests per package for packages with multiple versions exposed. Another potential issue is that PyPI pages can be large. I've found it convenient to use PyPI package pages as the home page for many of my projects. I like to include package documentation in my project pages. Perhaps this is an abuse of PyPI, but it is very convenient for me and no one has complained. :) The zc.buildout pages are around 200K. That's a fair bit of data for setuptools to download and scan for download URLs. In the course of this discussion, I've realized that it doesn't make sense for setuptools to use the same interface that humans use. setuptools doesn't need to see all of the data that is useful to humans. Similarly, humans generally don't need to see all of the historical releases for a project. I suggested a simple page format designed just for setuptools. An alternative would be an xmlrpc API. I prefer pages because I think that, over time, the amount of requests from automated tools like easy_install and zc.buildout will increase substantially and ultimately, will overwhelm dynamic servers, even ones like PyPI that are reasonably fast. I also think that a simple static collection of pages will be easier to mirror and I think some number of geographic mirrors is likely to help some people. I promised to prototype the format I suggested. I've created and experimental prototype setuptools-specific package index at http://download.zope.org/ppix Going to that page gives brief instructions for using it with easy_install and zc.buildout. To see an individual package page, add the package name to the URL, as in: http://download.zope.org/ppix/setuptools/ A few things to note about this: - I don't expose a long package list at http://download.zope.org/ ppix/. The long package list would be expensive to download and supports a use case that I consider to be of negative value, which is installing packages with case-insensitive package names, I think it is important for humans to be able to search for packages using case- insensitive search terms, but I think that, after identifying a package, precise package names should be used. I think it is especially important that precise package names be used in package requirements. - There is a single page per package. This can greatly reduce the number of requests. Packages that store all of their distributions in PyPI and that don't have off-site home pages or download URLs can be scanned with a single request. Note that I excluded home page and download URLs that pointed back to the packages PyPI page, as that wouldn't provide any new information to setuptools. - Download URLs for *hidden* packages are included. Humans don't need to see old revisions, but setuptools-based tools do. If we used an index like this for setuptools, we could stop unhiding old releases when we created new releases in PyPI. This would make PyPI more useful to humans and less of a pain for developers. - Download URLs are the same as they are in PyPI. Using this new index, distributions are still downloaded from PyPI, so the index doesn't affect PyPI download statistics. To see the impact of this, it's interesting to look at installing zc.buildout using easy_install from PyPI and from the experimental index: Installing using PyPI looks like this: (env)[EMAIL PROTECTED]:~/tmp$ time easy_install zc.buildout Searching for zc.buildout Reading http://cheeseshop.python.org/pypi/zc.buildout/ Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b19 Reading http://svn.zope.org/zc.buildout Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b22 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b23 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b20 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b21 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b26 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b27 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b24 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b25 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b28 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b17 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b16 Reading http://cheeseshop.python.org/pypi/zc.buildout/1.0.0b18 Best match: zc.buildout 1.0.0b28 Downloading http://cheeseshop.python.org/packages/2.5/z/ zc.buildout/zc.buildout-1.0.0b28- py2.5.egg#md5=4e37e53f010ed7984555a029732f479d Processing zc.buildout-1.0.0b28-py2.5.egg creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/ python2.5 Adding zc.buildout 1.0.0b28 to easy-install.pth file Installing buildout script to /home/jim/tmp/env/bin/ Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Processing dependencies for zc.buildout Searching for setuptools==0.6c6 Best match: setuptools 0.6c6 Processing setuptools-0.6c6-py2.5.egg Adding setuptools 0.6c6 to easy-install.pth file Installing easy_install script to /home/jim/tmp/env/bin/ Installing easy_install-2.5 script to /home/jim/tmp/env/bin/Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6- py2.5.eggProcessing dependencies for setuptools==0.6c6 Finished processing dependencies for setuptools==0.6c6 Finished installing setuptools==0.6c6 Finished processing dependencies for zc.buildout Finished installing zc.buildout real 0m31.360s user 0m1.136s sys 0m0.060s Note the large number of pages read. Here I was installing a single package with one dependency, setuptools, that was already installed. Let's look at this again using the experimental index: (env)[EMAIL PROTECTED]:~/tmp$ time easy_install -i http://download.zope.org/ ppix zc.buildout Searching for zc.buildout Reading http://download.zope.org/ppix/zc.buildout/ Best match: zc.buildout 1.0.0b28 Downloading http://cheeseshop.python.org/packages/2.5/z/ zc.buildout/zc.buildout-1.0.0b28- py2.5.egg#md5=4e37e53f010ed7984555a029732f479d Processing zc.buildout-1.0.0b28-py2.5.egg creating /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Extracting zc.buildout-1.0.0b28-py2.5.egg to /home/jim/tmp/env/lib/ python2.5 Adding zc.buildout 1.0.0b28 to easy-install.pth file Installing buildout script to /home/jim/tmp/env/bin/ Installed /home/jim/tmp/env/lib/python2.5/zc.buildout-1.0.0b28- py2.5.egg Processing dependencies for zc.buildout Searching for setuptools==0.6c6 Best match: setuptools 0.6c6 Processing setuptools-0.6c6-py2.5.egg Adding setuptools 0.6c6 to easy-install.pth file Installing easy_install script to /home/jim/tmp/env/bin/ Installing easy_install-2.5 script to /home/jim/tmp/env/bin/Installed /home/jim/tmp/env/lib/python2.5/setuptools-0.6c6- py2.5.eggProcessing dependencies for setuptools==0.6c6 Finished processing dependencies for setuptools==0.6c6 Finished installing setuptools==0.6c6 Finished processing dependencies for zc.buildout Finished installing zc.buildout real 0m7.006s user 0m0.244s sys 0m0.040s Note: - We made far fewer requests with the new index - Most of the time in the second example was spent actually downloading the buildout distribution. Most of the time in the first example was spent reading the index. - I used workingenv to create clean environments for each of the examples above. WRT zc.buildout, refreshing a buildout with just ZODB installed in it takes about 45 seconds for me using PyPI and about 5 seconds using the experimental index. Some of the speed improvements is due to the fact that the experimental index is much closer to me (on the net) than PyPI. ATM, requests to PyPI take *me* around 500 milliseconds, while requests to the experimental index are taking between 100 and 300 milliseconds. (I'm at home and this seems to be somewhat variable.) Most of the speed improvements are from reducing the number of requests. I'm polling PyPI once a minute to get and apply updates. Thanks tothe new XML-RPC method that Martin added, this is very efficient to do.I encourage people to check this out and even try using it with easy_install and especially buildout. AFAIK, aside from being much faster and showing download files for hidden releases it is completely equivalent to PyPI for setuptools use. My intension is to keep this experimental index going and up to date for the foreseeable future and plan to use it for all my work. My primary goal is to prototype the new index format. If this seems useful, then I think that www.python.org should expose an index in this format to setuptools, either at a different URL or by satisfying setuptools requests from the index based on client information. I'd love to see this index populated via a baking mechanism that updates package pages when they change, rather than through polling as I'm doing. There would be some benefit to having geographic mirrors. I suspect that having such mirrors available would improve performance further, at least for some folks. It might also be useful to have some mirrors for redundancy purposes. Note though that what I'm doing is mirroring the only index data. I'm not mirroring distributions. Of course, I'd be happy to make my software available. (It already is via our subversion repository.) I hope this effort spurs useful discussion and progress. Jim -- Jim Fulton mailto:[EMAIL PROTECTED] Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org _______________________________________________ Catalog-SIG mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/catalog-sig
-- "Although never is often better than *right* now." -- The Zen of Python, by Tim Peters Jodok Batlogg, Lovely Systems Schmelzhütterstraße 26a, 6850 Dornbirn, Austria phone: +43 5572 908060, fax: +43 5572 908060-77
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig