On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <hol...@merlinux.eu> wrote: > Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig: > scrutiny and feedback welcome.
Hi Holger. I'm having some difficulty interpreting your proposal because it is leaving out some things, and in other places contradicting what I know of how the tools work. It is also a bit at odds with itself in some places. For instance, at the beginning, the PEP states its proposed solution is to host all release files on PyPI, but then the problem section describes the problems that arise from crawling external pages: problems that can be solved without actually hosting the files on PyPI. To me, it needs a clearer explanation of why the actual hosting part also needs to be on PyPI, not just the links. In the threads to date, people have argued about uptime, security, etc., and these points are not covered by the PEP or even really touched on for the most part. (Actually, thinking about that makes me wonder.... Donald: did your analysis collect any stats on *where* those externally hosted files were hosted? My intuition says that the bulk of the files (by *file count*) will come from a handful of highly-available domains, i.e. sourceforge, github, that sort of thing, with actual self-hosting being relatively rare *for the files themselves*, vs. a much wider range of domains for the homepage/download URLs (especially because those change from one release to the next.) If that's true, then most complaints about availability are being caused by crawling multiple not-highly-available HTML pages, *not* by the downloading of the actual files. If my intuition about the distribution is wrong, OTOH, it would provide a stronger argument for moving the files themselves to PyPI as well.) Digression aside, this is one of things that needs to be clearer so that there's a better explanation for package authors as to why they're being asked to change. And although the base argument is good ("specifying the "homepage" will slow down the installation process"), it could be amplified further with an example of some project that has had multiple homepages over its lifetime, listing all the URLs that currently must be crawled before an installer can be sure it has found all available versions, platforms, and formats of the that project. Okay, on to the Solution section. Again, your stated problem is to fix crawling, but the solution is all about file hosting. Regardless of which of these three "hosting modes" is selected, it remains an option for the developer to host files elsewhere, and provide the links in their description... unless of course you intended to rule that out and forgot to mention it. (Or, I suppose, if you did *not* intend to rule it out and intentionally omitted mention of that so the rabid anti-externalists would think you were on their side and not create further controversy... in which case I've now spoiled things. Darn. ;-) ) Some technical details are also either incorrect or confusing. For example, you state that "The original homepage/download links are added as links without a ``rel`` attribute if they have the ``#egg`` format". But if they are added without a rel attribute, it doesn't *matter* whether they have an #egg marker or not. It is quite possible for a PyPI package to have a download_url of say, "http://sourceforge.net/download/someproject-1.2.tgz". Thus, I would suggest simply stating that changing hosting mode does not actually remove any links from the /simple index, it just removes the rel="" attributes from the "Home page" and "Download" links, thus preventing them from being crawled in search of additional file links. With that out of the way, that brings me to the larger scope issue with the modes as presented. Notice now that with this clarification, there is no real difference in *state* between the "pypi-cache" and "pypi-only" modes. There is only a *functional* difference... and that function is underspecified in the PEP. What I mean is, in both pypi-cache and pypi-only, the *state* of things is that rel="" attributes are gone, and there are links to files on PyPI. The only difference is in *how* the files get there. And for the pypi-cache mode, this function is *really* under-specified. Arguably, this is the meat of the proposal, but it is entirely missing. There is nothing here about the frequency of crawling, the methods used to select or validate files, whether there is any expiration... it is all just magically assumed to happen somehow. My suggestion would be to do two things: First, make the state a boolean: crawl external links, with the current state yes and the future state no, with "no" simply meaning that the rel="" attribute is removed from the links that currently have it. Second, propose to offer tools in the PyPI interface (and command line) to assist authors in making the transition, rather than proposing a completely unspecified caching mechanism. Better to have some vaguely specified tools than a completely unspecified caching mechanism, and better still to spell out very precisely what those tools do. Okay, on to the "Phases of transtion". This section gets a lot simpler if there are only two stages. Specifically, we let everyone know the change is going to happen, and how long they have, give 'em links to migration tools. Done. ;-) (Okay, so analysis still makes sense: the people who don't have any externally hosted files can get a different message, i.e., "Hey, we notice that installing your package is slow because you have these links that don't go anywhere. Click here if you'd like PyPI to stop sending people on wild goose chases". The people who have external hosted files will need a more involved message.) Whew. Okay, that ends my critique of the PEP as it sits. Now for an outside-the-box suggestion. If you'd like to be able to transition people away from spidered links in the fewest possible steps, with the least user action, no legal issues, and in a completely automated way, note that this can be done with a one-time spidering of the existing links to find the download links, then adding those links directly to the /simple index, and switching off the rel="" attributes. This can be done without explicit user consent, though they can be given the chance to do it manually, sooner. To implement this you'd need two project-level (*not* release-level) fields: one to indicate whether the project is using rel="" or not, and one to contain the list of external download links, which would be user-editable. This overall approach I'm proposing can be extended to also support mirroring, since it provides an explicit place to list what it is you're mirroring. (At any rate, it's more explicitly specified than any such place in the current PEP.) That field can also be fairly easily populated for any given project in just a few lines of code: from pkg_resources import Requirement pr = Requirement.parse('Projectname') from setuptools.package_index import PackageIndex pi = PackageIndex(search_path=[], python=None, platform=None) pi.find_packages(pr) all_urls = dist.location for dist in pi[pr.key] external_urls = [ url for url in all_urls if not '//pypi.python.org' in url] (Although if you want more information, like what kind of link each one is, the dist objects actually know a bit more than just the URL.) Anyway, I hope you found at least some of all this helpful. ;-) _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig