Hi Philip, thanks for your helpful review, almost all makes sense to me ... some more inlined comments below. Up front, i am open to you co-authoring the PEP if you like and share the goal to find a minimum viable approach to speed up and simplify the interactions for installers.
On Sun, Mar 10, 2013 at 15:41 -0400, PJ Eby wrote: > On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <hol...@merlinux.eu> wrote: > > Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig: > > scrutiny and feedback welcome. > > Hi Holger. I'm having some difficulty interpreting your proposal > because it is leaving out some things, and in other places > contradicting what I know of how the tools work. It is also a bit at > odds with itself in some places. Certainly, it was a quick draft to get the process going and useful feedback which worked so far :) > For instance, at the beginning, the PEP states its proposed solution > is to host all release files on PyPI, but then the problem section > describes the problems that arise from crawling external pages: > problems that can be solved without actually hosting the files on > PyPI. > > To me, it needs a clearer explanation of why the actual hosting part > also needs to be on PyPI, not just the links. In the threads to date, > people have argued about uptime, security, etc., and these points are > not covered by the PEP or even really touched on for the most part. Makes sense to clarify this more. > (Actually, thinking about that makes me wonder.... Donald: did your > analysis collect any stats on *where* those externally hosted files > were hosted? My intuition says that the bulk of the files (by *file > count*) will come from a handful of highly-available domains, i.e. > sourceforge, github, that sort of thing, with actual self-hosting > being relatively rare *for the files themselves*, vs. a much wider > range of domains for the homepage/download URLs (especially because > those change from one release to the next.) If that's true, then most > complaints about availability are being caused by crawling multiple > not-highly-available HTML pages, *not* by the downloading of the > actual files. If my intuition about the distribution is wrong, OTOH, > it would provide a stronger argument for moving the files themselves > to PyPI as well.) > > Digression aside, this is one of things that needs to be clearer so > that there's a better explanation for package authors as to why > they're being asked to change. And although the base argument is good > ("specifying the "homepage" will slow down the installation process"), > it could be amplified further with an example of some project that has > had multiple homepages over its lifetime, listing all the URLs that > currently must be crawled before an installer can be sure it has found > all available versions, platforms, and formats of the that project. Right, an example makes sense. > Okay, on to the Solution section. Again, your stated problem is to > fix crawling, but the solution is all about file hosting. Regardless > of which of these three "hosting modes" is selected, it remains an > option for the developer to host files elsewhere, and provide the > links in their description... unless of course you intended to rule > that out and forgot to mention it. (Or, I suppose, if you did *not* > intend to rule it out and intentionally omitted mention of that so the > rabid anti-externalists would think you were on their side and not > create further controversy... in which case I've now spoiled things. > Darn. ;-) ) To be honest, while drafting i forgot about the fact that the long_description can contain package links as well. > Some technical details are also either incorrect or confusing. For > example, you state that "The original homepage/download links are > added as links without a ``rel`` attribute if they have the ``#egg`` > format". But if they are added without a rel attribute, it doesn't > *matter* whether they have an #egg marker or not. It is quite > possible for a PyPI package to have a download_url of say, > "http://sourceforge.net/download/someproject-1.2.tgz". Right. I just wanted to clarify that the distutils metadata "download_url" can contain an #egg format link and that this link should still be served (without a rel). > Thus, I would suggest simply stating that changing hosting mode does > not actually remove any links from the /simple index, it just removes > the rel="" attributes from the "Home page" and "Download" links, thus > preventing them from being crawled in search of additional file links. That's certainly a better description of what effectively happens and avoids the special mentioning of #egg. > With that out of the way, that brings me to the larger scope issue > with the modes as presented. Notice now that with this clarification, > there is no real difference in *state* between the "pypi-cache" and > "pypi-only" modes. There is only a *functional* difference... and > that function is underspecified in the PEP. Agreed. > What I mean is, in both pypi-cache and pypi-only, the *state* of > things is that rel="" attributes are gone, and there are links to > files on PyPI. The only difference is in *how* the files get there. Yes. > And for the pypi-cache mode, this function is *really* > under-specified. Arguably, this is the meat of the proposal, but it > is entirely missing. There is nothing here about the frequency of > crawling, the methods used to select or validate files, whether there > is any expiration... it is all just magically assumed to happen > somehow. I'd like to avoid cache-invalidation issues by only performing cache updates upon three user actions: - when a release is registered for a package which is in "pypi-cache" hosting mode. - when a maintainer chooses to set "pypi-cache" - when a maintainer explicitely triggers a "cache" update All actions allow pypi.python.org to provide feedback / error codes so there is nothing hidden going on in regular intervals or so. > My suggestion would be to do two things: > > First, make the state a boolean: crawl external links, with the > current state yes and the future state no, with "no" simply meaning > that the rel="" attribute is removed from the links that currently > have it. > > Second, propose to offer tools in the PyPI interface (and command > line) to assist authors in making the transition, rather than > proposing a completely unspecified caching mechanism. Better to have > some vaguely specified tools than a completely unspecified caching > mechanism, and better still to spell out very precisely what those > tools do. This structure makes sense to me except that i see the need to start off with "pypi-ext", i.e. a third state which encodes the current behaviour. Thing is that the pypi.python.org doesn't have an extensive test suite and we will thus need to rely on a few early adopters using the tools/state-changes before starting phase 2 (mass mailings etc.). Also in case of problems we can always switch back packages to the safe "pypi-ext" mode. IOW, the motiviation for this third state is considering the actual implementation process. > Okay, on to the "Phases of transtion". This section gets a lot > simpler if there are only two stages. Specifically, we let everyone > know the change is going to happen, and how long they have, give 'em > links to migration tools. Done. ;-) > > (Okay, so analysis still makes sense: the people who don't have any > externally hosted files can get a different message, i.e., "Hey, we > notice that installing your package is slow because you have these > links that don't go anywhere. Click here if you'd like PyPI to stop > sending people on wild goose chases". The people who have external > hosted files will need a more involved message.) > > Whew. Okay, that ends my critique of the PEP as it sits. Now for an > outside-the-box suggestion. > > If you'd like to be able to transition people away from spidered links > in the fewest possible steps, with the least user action, no legal > issues, and in a completely automated way, note that this can be done > with a one-time spidering of the existing links to find the download > links, then adding those links directly to the /simple index, and > switching off the rel="" attributes. This can be done without > explicit user consent, though they can be given the chance to do it > manually, sooner. Right, my mail preceding the "pre-pep" one contained a "linkext" state which spidered the links and offered them directly. It's certainly possible and indeed would likely not have legal issues. It might have cache-invalidation issues and probably makes the pypi-side implementation more complex. Also it goes a bit against the current intention of the PEP to have pypi.python.org control all hosting of release files. > To implement this you'd need two project-level (*not* release-level) > fields: one to indicate whether the project is using rel="" or not, > and one to contain the list of external download links, which would be > user-editable. > > This overall approach I'm proposing can be extended to also support > mirroring, since it provides an explicit place to list what it is > you're mirroring. (At any rate, it's more explicitly specified than > any such place in the current PEP.) > > That field can also be fairly easily populated for any given project > in just a few lines of code: > > from pkg_resources import Requirement > pr = Requirement.parse('Projectname') > from setuptools.package_index import PackageIndex > pi = PackageIndex(search_path=[], python=None, platform=None) > pi.find_packages(pr) > all_urls = dist.location for dist in pi[pr.key] > external_urls = [ url for url in all_urls if not '//pypi.python.org' in > url] > > (Although if you want more information, like what kind of link each > one is, the dist objects actually know a bit more than just the URL.) > > Anyway, I hope you found at least some of all this helpful. ;-) Certainly! Will try to do an update incorporating your suggestions in the next days. best, holger _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig