Re: [Catalog-sig] Perhaps PyPI will do
Hi David, On Thu, Apr 07, 2005 at 09:32 -0700, David Ascher wrote: I find the discussion depressing in many ways. Did i miss some of the discussion? At least on catalog-sig and in the blogs it was going quite ok in my opionion. But maybe we had different expectations :-) holger ___ Catalog-sig mailing list Catalog-sig@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] current repo of pypi
Hello, The http://wiki.python.org/moin/CheeseShopDev page mentioned that the repo is undergoing migration. Is there some (even intermediate) url which i could pull today? thanks, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] disabling the serving of links from description_html?
Hi Richard, hi all, While reading the pypi main and other sources i wondered how we could switch off serving links from description_html, at least on a per-project basis. It's really annoying that when you start to add some links to a long_description that installation of your package will thus slow down around the world. Even if you remove the links from the next release. How could we arrange for a maintainer to communicate to the pypi-server that a particular project should not ever serve links from description_html (and maybe not even from the homepage while we are at it)? Preferably it should be something that can be done from existing setup.py files, like adding a special trove-classifier or keyword. But a little custom tool or a web page form would be ok as well. If maintainers could easily switch off these extra links, then this means less stress for the pypi server and a global considerable speedup of installing python packages as often most of the pip/easy_install time is spent with checking out these URLs. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] disabling the serving of links from description_html?
On Tue, Dec 18, 2012 at 5:46 PM, M.-A. Lemburg m...@egenix.com wrote: On 18.12.2012 15:54, Holger Krekel wrote: Hi Richard, hi all, While reading the pypi main and other sources i wondered how we could switch off serving links from description_html, at least on a per-project basis. It's really annoying that when you start to add some links to a long_description that installation of your package will thus slow down around the world. Even if you remove the links from the next release. How could we arrange for a maintainer to communicate to the pypi-server that a particular project should not ever serve links from description_html (and maybe not even from the homepage while we are at it)? Preferably it should be something that can be done from existing setup.py files, like adding a special trove-classifier or keyword. But a little custom tool or a web page form would be ok as well. If maintainers could easily switch off these extra links, then this means less stress for the pypi server and a global considerable speedup of installing python packages as often most of the pip/easy_install time is spent with checking out these URLs. Are you sure about about this ? AFAIK, setuptools/distribute only looks at links with rel=homepage or rel=download attributes, not all links on the PyPI project page. The links from the description don't receive such attributes. See e.g. http://pypi.python.org/simple/pytest/ You are right, Marc. Only the download and home page links (from all versions ever published) are considered from pip/easy_install. I just examined it more closely via urlsnarf. They were so many in some projects and mixed with the other links so i didn't see it clearly before (although i did notice the rel classification). So to avoid the overhead one could retroactively remove all download links and maybe also all homepage links except the one for the latest version or so. But that can be done without changes to pypi itself i guess. best thanks for the clarification, holger -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 18 2012) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2012-12-14: Released mxODBC.Connect 2.0.2 ... http://egenix.com/go38 2012-12-05: Released eGenix pyOpenSSL 0.13 ...http://egenix.com/go37 2013-01-22: Python Meeting Duesseldorf ... 35 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] fresh pep381run pypi-mirroring fails since 1 week
Hi all, During the last 7 days i tried running pep381run with a fresh directory on two different hosts. They both failed while trying to copy azb_nester-1.2.0.tar.gz, see here for the traceback: http://bpaste.net/show/SoMoyjdJEIGvm99dH6gG/ It seems that azb_nester does not have any files anymore on pypi.python.org, they probably got deleted. Is that a bug in the pep381run software or in the pep381 mirroring protocol or ...? best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] test pypi server?
Hey Chris, according to http://pypi.python.org there should be a test pypi server at http://testpypi.python.org/pypi but at the moment it gives 502 Bad Gateway. cheers, holger On Sat, Jan 26, 2013 at 10:33 AM, Chris Withers ch...@simplistix.co.ukwrote: Hi All, I remember mention of a test PyPI server that had been set up. Where can I find it? I'm doing some automated release testing... Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk __**_ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/**mailman/listinfo/catalog-sighttp://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 5, 2013 at 1:51 PM, Donald Stufft donald.stu...@gmail.comwrote: On Tuesday, February 5, 2013 at 5:16 AM, Lennart Regebro wrote: 1. Packages should only be installed from the given package indexes. No scraping of websites as at least easy_install/buildout does, no downloading from external download links. A deprecation period for this of a couple of months, to give package authors the chance to upload their packages is probably necessary. PyPI will need to change for this to happen realistically if I recall. There is a hard limit on how large of a distribution can be uploaded to PyPI and there are, if I recall, valid distributions which are larger than that. Personally I want the installers to only install from PyPI so my suggestion if this is something that (the proverbial) we want to do, PyPI should gain some notion of a soft limit for distribution upload (to prevent against DoS) with the ability to increase that size limit for specific projects who can file a ticket w/ PyPI to have their limit increased. Dropping the crawling over external pages needs _much_ more than just a few months deprecation warnings, rather years. There are many packages out there, and it would break people's installations. As a random example, look at http://pypi.python.org/simple/lockfile/ - it has its last release in 2010 and 74K downloads from the 0.9 download url (going to code.google.com). I certainly agree, though, that the current client-side crawling is a nuisance and makes for unreliability of installation procedures. I think we should move the crawling to the server side and cache packages. I am currently working on a prototype which does this (and a few other niceties). It allows to keep all installers and packages working nicely, serving all packages from one central place (cached on demand currently but that is a policy issue). best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 5, 2013 at 2:05 PM, Jesse Noller jnol...@gmail.com wrote: On Feb 5, 2013, at 8:02 AM, Holger Krekel holger.kre...@gmail.com wrote: On Tue, Feb 5, 2013 at 1:51 PM, Donald Stufft donald.stu...@gmail.comwrote: On Tuesday, February 5, 2013 at 5:16 AM, Lennart Regebro wrote: 1. Packages should only be installed from the given package indexes. No scraping of websites as at least easy_install/buildout does, no downloading from external download links. A deprecation period for this of a couple of months, to give package authors the chance to upload their packages is probably necessary. PyPI will need to change for this to happen realistically if I recall. There is a hard limit on how large of a distribution can be uploaded to PyPI and there are, if I recall, valid distributions which are larger than that. Personally I want the installers to only install from PyPI so my suggestion if this is something that (the proverbial) we want to do, PyPI should gain some notion of a soft limit for distribution upload (to prevent against DoS) with the ability to increase that size limit for specific projects who can file a ticket w/ PyPI to have their limit increased. Dropping the crawling over external pages needs _much_ more than just a few months deprecation warnings, rather years. There are many packages out there, and it would break people's installations. As a random example, look at http://pypi.python.org/simple/lockfile/ - it has its last release in 2010 and 74K downloads from the 0.9 download url (going to code.google.com). I certainly agree, though, that the current client-side crawling is a nuisance and makes for unreliability of installation procedures. I think we should move the crawling to the server side and cache packages. I am currently working on a prototype which does this (and a few other niceties). It allows to keep all installers and packages working nicely, serving all packages from one central place (cached on demand currently but that is a policy issue). best, holger Derived from the current pypi code base? No. Using it as a reference rather, and rewritten with a TDD approach, can't help it :) holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 5, 2013 at 2:13 PM, Lennart Regebro rege...@gmail.com wrote: On Tue, Feb 5, 2013 at 2:02 PM, Holger Krekel holger.kre...@gmail.com wrote: Dropping the crawling over external pages needs _much_ more than just a few months deprecation warnings, rather years. There are many packages out there, and it would break people's installations. No it won't. Nothing gets uninstalled. What stops working is installing those packages with pip/easy_install. And that will start again as soon as the maintainer uploads the last version to PyPI, which she/he is likely to do quite quickly after people start complaining. I wouldn't assume that maintainers are easily reachable. I've contacted at least three people of different (1K downloads) packages which never responded. And of course, i didn't mean to imply that already installed packages would suddenly break. Rather that installation instructions like use pip install X will just fail with some dependency Y not getting installed. Or getting installed in some random lower version which might contain evil bugs (including security bugs). For exmaple, the referenced lockfile project has a 0.2 release on pypi, but is currently at 0.9. I certainly agree, though, that the current client-side crawling is a nuisance and makes for unreliability of installation procedures. I think we should move the crawling to the server side and cache packages. That will mean that a man in the middle-attack might poison PyPI's cache. I don't think that's a feasible path forward. Like i said (you snipped that part of the mail), it's a matter of policy. Externally available packages could be downloaded at once, and not on demand. Such a download and checksumming could be repeated over a period of time and from different machines. Of course a remotely stored package could already be compromised - but such a possibility always exists (even if an author signs a package with PGP - his machine might be infiltrated, or the Jenkins build systems performing automated releases etc.). Packages does not need to be cached, as they are not supposed to change. If you change the package you should really release a new version. (Unless you made a mistake and discovered it before anyone actually downloaded it). So what you are proposing is really that PyPI downloads the package from an untrusted source, if the maintainer doesn't upload it. I prefer that we demand that the maintainer upload it. I actually think it might make sense to forbid referencing external files for _future_ pypi uploads (except #egg= references probably). The maintainer trying to do that, then gets a clear error and instructions how to proceed. She is just trying to get something out, so we have her attention. Changing pip/distribute-easy_install defaults to require an option for installing packages coming from link rel-types of download or homepage might make sense as well. In the end, however, none of this prevents MITM attacks between a downloader and pypi.python.org. Or between the uploader and pypi.python.org(using basic auth over http often). Signing methods like https://wiki.archlinux.org/index.php/Pacman-key are key. If a signature is available (also at a download_url site), then we can exclude undetected tampering. And there might not be a need to break currently working package releases. It certainly makes sense to fortify python packaging and installation procedures, but i'd like a bit more of a systematic view on it, including reviews from security-focused people and a somewhat incremental verified approach to turn it real and used. best, holger //Lennart ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 05, 2013 at 15:46 +0100, Giovanni Bajo wrote: Il giorno 05/feb/2013, alle ore 15:06, Holger Krekel holger.kre...@gmail.com ha scritto: In the end, however, none of this prevents MITM attacks between a downloader and pypi.python.org. Or between the uploader and pypi.python.org (using basic auth over http often). Signing methods like https://wiki.archlinux.org/index.php/Pacman-key are key. If a signature is available (also at a download_url site), then we can exclude undetected tampering. And there might not be a need to break currently working package releases. A signature is not enough; if you don't have a secure channel, signatures can be replayed. Eg: if you install through an unsecure channel and you just verify GPG signatures on the package, I can MITM you and serve you an older, vulnerable package version (with its correct signature), and then go exploit that vulnerability. Point taken. I guess unless someone sits down and writes a PEP-ish path for fortification, it's gonna be hard to assess viability and resilience against the several attack vectors which should be sorted/prioritized. Or is somebody on that already? (there were hints of some background discussions - not sure that's helping much as most attack vectors against the python packaging ecosystem are kind of well known or easy to guess after a bit of research and experimentation). best, holger -- Giovanni Bajo :: ra...@develer.com Develer S.r.l. :: http://www.develer.com My Blog: http://giovanni.bajo.it ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 05, 2013 at 16:07 +0100, Lennart Regebro wrote: On Tue, Feb 5, 2013 at 3:06 PM, Holger Krekel holger.kre...@gmail.com wrote: I wouldn't assume that maintainers are easily reachable. I've contacted at least three people of different (1K downloads) packages which never responded. We really can't do very much about abandoned packages. And of course, i didn't mean to imply that already installed packages would suddenly break. Rather that installation instructions like use pip install X will just fail with some dependency Y not getting installed. Or getting installed in some random lower version which might contain evil bugs (including security bugs). For exmaple, the referenced lockfile project has a 0.2 release on pypi, but is currently at 0.9. There is no way around that problem, except other people than the maintainers uploading the software to PyPI. That's certainly an option, and I have no good argument against it, but I don't like it. (Obviously it can only be done for software marked with relevant licenses). In the end, however, none of this prevents MITM attacks between a downloader and pypi.python.org. Sure, and that's another problem, and the low-hanging fruit there is using https. Transporting almost all externally reachable packages to be locally pypi served is also kind of a low hanging fruit, although probably slightly higher hanging than SSL :) The point is that we can have some control over those packages once we have them - so we can delete them if they are reported to be malicious independently of maintainer reachability. If a signature is available (also at a download_url site), then we can exclude undetected tampering. If they can change the file at the download_url site, then they surely can change the signature? No, because a signature can only be created by the original author for a particular file (his upload), not from the download site or a MITM-attacker for a different file. best, holger //Lennart ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 05, 2013 at 10:18 -0500, Donald Stufft wrote: On Tuesday, February 5, 2013 at 10:14 AM, holger krekel wrote: Transporting almost all externally reachable packages to be locally pypi served is also kind of a low hanging fruit, although probably slightly higher hanging than SSL :) The point is that we can have some control over those packages once we have them - so we can delete them if they are reported to be malicious independently of maintainer reachability. We have no way to validate the package we are downloading is the accurate one, we should not infer trust/validation that doesn't exist. MITM attacking any of the many world-wide pypi/easy_install downloads from external sites is much easier than tampering a few one-time downloads (verified against each other) for pypi.python.org's serving purposes. By contrast, changing client-side tools and defaults is going to take much longer and will not reach everybody. IOW, i believe that improving the serving side good low hanging fruit. No, because a signature can only be created by the original author for a particular file (his upload), not from the download site or a MITM-attacker for a different file. This assumes we know what the correct key is. If we don't then we have no way to validate that the signature was created by the author and not by someone else. Trust is hard. Sure, you need sig-validation infrastructure for this. And Sig-validation is a much higher hanging fruit than using https on pypi.python.org. best, holger best, holger //Lennart ___ Catalog-SIG mailing list Catalog-SIG@python.org (mailto:Catalog-SIG@python.org) http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org (mailto:Catalog-SIG@python.org) http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Use user-specific site-packages by default?
On Tue, Feb 05, 2013 at 15:54 -0500, Terry Reedy wrote: On 2/5/2013 11:35 AM, Lennart Regebro wrote: On Tue, Feb 5, 2013 at 5:03 PM, Donald Stufft donald.stu...@gmail.com wrote: Besides the issues with validating that the package We are mirroring is the authentic one there's also a legal issue. We don't know for sure that we have the legal rights to redistribute those files. When you upload a file to PyPI you grant the PSF a license to do that, no upload from the author = no license. IANAL but i think i'm correct on that. Absolutely, but if the package is marked with a license that allows redistribution in the metadata, then we can. The last I read (and I cannot find the seemingly hidden page) the author (or rights-holder) of code must grant PSF something more than just redistribution rights before uploading it. The same must also certify some mumbo-jumbo about compliance with national laws and cryptography. No 3rd party can do that. Not sure i understand. Are you referring to a procedure that is in place already or that should be in place? I consider the activity of caching 3rd party packages that are offered through PyPI's metadata and which can be downloaded freely from everwhere as similar to what web caches like squid do. A quick scan produced this sentence from http://en.wikipedia.org/wiki/Web_cache : In 1998, the DMCA added rules to the United States Code (17 U.S.C. §: 512) that relinquishes system operators from copyright liability for the purposes of caching. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] RubyGems Threat Model and Requirements
On Tue, Feb 12, 2013 at 12:44 -0500, Daniel Holth wrote: On Tue, Feb 12, 2013 at 11:27 AM, Giovanni Bajo ra...@develer.com wrote: Your Task #6/#7 (related to PyPI generating the trust file, and pip verifying it) are the ones where I think the input of the TUF team will be most valuable, as well as potentially the folks responding to the rubygems.org attack. My undestanding is that #6/#7 are not currently covered by TUF. So yes, I would surely value their input to review my design, evolve it together or scratch it and come up with something new. Sorry for the repetition, but I also volunteer for implementation. I don't mind if someone else does it (or a subset of it, or we split, etc.), but I think it is important to say that this is not a theoretical proposal that someone else will have to tackle, but I'm happy to submit patches (all of them, in the worst case) to the respective maintainers and rework them until they are acceptable. The rubygems.org will also be looking at server side incident response - I suspect a lot of that side of things will end up running through the PSF infrastructure team moreso than catalog-sig (although it may end up here if it involves PyPI code changes. While I do have some ideas, I don't think I'm fully qualified for that side of things. Primarily, my proposal helps by not forcing PyPI to handle an online master signing key with all the required efforts (migration, rotation, mirroring, threat responses, mitigations, etc.). If you read it, you had seen that PyPI is only required to validate signature (like pip), not sign anything. The alternative is to just use a system implemented by several PhD [candidates?] in 2010 based on years of update system experience, before pypi security was cool. A doc from last week is a hard sell. For one, not all PHDs follow clean implementation and automated testing principles. Secondly, I appreciate Giovanni's input, work, and his offer to help with implementation. Let's not be too quick to dismiss it. On a funny sidenote, he is the only one with a successfully openssl-verified email in these security related email threads, just saying :) best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] HTTPS now promoted on PyPI
On Tue, Feb 19, 2013 at 14:23 +0100, Giovanni Bajo wrote: Il giorno 19/feb/2013, alle ore 06:13, Richard Jones r1chardj0...@gmail.com ha scritto: Hi all, I've just altered the nginx configuration to promote (ie. redirect to) HTTPS for all GET/HEAD requests. This includes HSTS, but I've set the lifetime to 1 day just in case there's some HTTPS compatibility issues. Once it's bedded down I'll bump it to a year. What is the benefits of redirects? I think they just hide potential problems, and they still can be exploited by MITM through ssl-stripping. Plus, they cause breakage and/or UX problems in existing tools. Given that they give basically no security, I would suggest their removal until we fix all important issues in all third-party tools. For browsers, since you can still serve HSTS headers even without redirects, we can get it included in Chrome and Firefox builtin HSTS list. 2. incorporate some monkey-patching into distribute and setuptools and promote those, I think this is our best bet for an immediate and global solution for outdated versions of Python as well. I will work to prepare a distutils patch that is compatible with 2.6 (which includes SSL), and then adapt it for 2.7 and 3.x. Do we have numbers of how many 2.5-compatible packages have been updated in the last 6 months? FYI i did a number of py25 compatible releases of projects in the last 6 months - but i generally upload the dist files from higher python versions, so no patch for 2.5 needed (or 2.6 for that matter). best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Wed, Feb 27, 2013 at 14:49 -0500, Monty Taylor wrote: On 02/27/2013 02:47 PM, Aaron Meurer wrote: On Wed, Feb 27, 2013 at 11:37 AM, holger krekel hol...@merlinux.eu wrote: On Wed, Feb 27, 2013 at 19:34 +0100, Lennart Regebro wrote: On Wed, Feb 27, 2013 at 5:34 PM, M.-A. Lemburg m...@egenix.com wrote: I'm not saying that it's not a good idea to host packages on PyPI, but forcing the community into doing this is not a good idea. I still don't understand why not. The only reasons I've seen are Because they don't want to or because they don't trust PyPI. And in the latter case I'm assuming they wouldn't use PyPI at all. And of course, nobody is forcing anyone, just like nobody is forcing you to use PyPI. :-) I understood there is the idea to disable external links within a couple of months. That does break backward compatibility in a considerable way. holger But wouldn't this only be a change in pip/easy_install, not PyPI itself? I suppose you could explicitly break the external links by having them point to nothing if you are worried about the security or if it's some performance issue (that would indeed be a bad compatibility break, in case people are using those for other purposes). Otherwise, if it's a problem, then just use the old version of pip. If we don't remove the feature from pypi itself, then it won't help the folks for whom its a problem, because there will be no incentive for the folks hosting their software that way to actually upload their stuff to PyPI - which means that client-side disabling of external_links is fairly likely to never be usable. I can see it's tempting to just try to force everyone to upload their stuff to pypi.python.org. I am very skeptical about this approach. There already is a high frustration with the packaging ecology in Python. When we remove external links on the server side, installs for many people and companies are going to break, no matter what. And they would have no client-side switch anymore to make things working. Requiring to use older setuptools/pip versions would not help because the server information is gone. That'd just increase frustration. So at the very least using external links needs to be a client-side installer choice for a long while and the server needs to offer the according information. I'd generally prefer to think hard about ways to improve the situation without breaking things. Putting simple/ and packaging serving on a CDN is one such step and a good idea i think. Establishing a signing/verification mechanism is another. Refining py2/py3 dependency discovery yet another good one. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Wed, Feb 27, 2013 at 22:04 +0100, Lennart Regebro wrote: On Wed, Feb 27, 2013 at 8:49 PM, Monty Taylor mord...@inaugust.com wrote: But wouldn't this only be a change in pip/easy_install, not PyPI itself? I suppose you could explicitly break the external links by having them point to nothing if you are worried about the security or if it's some performance issue (that would indeed be a bad compatibility break, in case people are using those for other purposes). Otherwise, if it's a problem, then just use the old version of pip. If we don't remove the feature from pypi itself It isn't a feature of PyPI. PyPI doesn't require you to upload the files to PyPI. For that reason, easy_install and PIP will scrape external sites to be able to download the files. What we should do is agree that this should stop, and a deprecation warning to pip and easy_install and after some pre-determined time remove the feature from easy_install and pip. I suggest to *change defaults* rather than to remove the feature for the foreseeable future. Changing defaults is a powerful way to communicate and one that doesn't leave people totally stranded who are far removed from discussions and rationales here. folks for whom its a problem, because there will be no incentive for the folks hosting their software that way to actually upload their stuff to PyPI Yes there will be: Everyone mailing them to tell them there software is broken and can't be installed with easy_install and pip. That's going to be very annoying very fast. ;-) I've mailed several maintainers in the last half year of 1K downloaded projects to inquire about status, and not received replies. I wanted to base work on their projects and of course i refrained from doing that because of the lack of replies. To me that means you can have users mailing maintainers or screaming at maintainers or saying bad words about maintainers or projects all you want but that doesn't mean it's going to be fixed. To summarize, having pip/easy_install report red warnings and requiring to pass a --htmlscrape=PROJ1,PROJ2 option or so is a good way to communicate, removing the ability is not, at this point. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Thu, Feb 28, 2013 at 09:48 +1100, Richard Jones wrote: On 28 February 2013 08:31, PJ Eby p...@telecommunity.com wrote: OTOH, I currently make development snapshots of setuptools and other projects available by dumping them in a directory that's used as an external download URL. Replacing that would be a PITA because PyPI only lets you upload and register new releases from distutils' command line. Basically, I'd need to use a download link that pointed to a latest URL that redirected to the final download. Yup, and the down-side of distutils as the tool for talking to PyPI is, of course, the horrendous turn-around time trying to add features or fix bugs. I've advocated us having the upload/register/whatever functionality in a separate tool for a while, but that doesn't seem to have gained any traction. Of course issues around the complexity introduced by setup.py make it much harder. FWIW three days ago i presented at Pycon Russia a unifying cmdline workflow meta tool which configures and invokes setup.py [...]/pip/easy_install commands. I intend to publish it soon and will also send a link once the video becomes available. IOW, i fully agree we need to move away from putting things into setup.py/distutils, start going for PEP426 etc. -- but WITHOUT breaking things for all the packaging upload/installation processes out there. Therefore a meta tool approach to make it easier for people to gradually move away from current practises. cheers, holger In the mean time I think Donald's suggestion for supporting development pre-releases is reasonable: instead of (please replace with easy_install lingo here) `pip install setuptools==setuptools-dev` please `pip install -e http://svn.python.org/projects/sandbox/trunk/setuptools/#egg=setuptools-dev` ? Richard ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Thu, Feb 28, 2013 at 06:38 +0100, Andreas Jung wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 +1 for the proposal The complete discussion on this topic is once again absurd and bizarre. We are discussing the issue with externally hosted packages every year and the situation has not improved. Especially people using buildout encounter very regulary issues with external site being down - with the result that we can not install or update our installation. I give a shit at the arguments pulled out every time by package maintainers using PyPI only for listing their packages. I am both annoyed and bothered by these people. I didn't see such positions from package maintainers here. In fact i haven't seen anyone stepping up saying listing packages externally is a great idea. Could you point to those posts? However, I have seen concerns about breaking many people's and companies processes and thus thoughts on how to do a good transition. I guess you don't want to communicate to package-users the way you do above to package maintainers. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Thu, Feb 28, 2013 at 16:30 +0100, Lennart Regebro wrote: On Thu, Feb 28, 2013 at 10:43 AM, Lennart Regebro rege...@gmail.com wrote: On Thu, Feb 28, 2013 at 9:28 AM, Nick Coghlan ncogh...@gmail.com wrote: Pissing off the maintainers off packages that currently rely on external hosting by telling them they have to change their release processes if they want to keep releasing software on PyPI and have their users actually be able to download it is *not* a good idea, especially when we're about to ask them to upgrade their build chains for other reasons (including both security and reliability). Who are these people by the way? I can answer that question now. I have a list of 2651 emails of people listed as maintainers or authors of software that doesn't have releases on PyPI. This is a very inclusive list, so it's lists *all* maintainers and authors of *all* versions of a package, if that package has no files on PyPI. And there are duplicate people, of course, although the emails are unique. There are also packages which have some (older) release files on pypi and newer ones outside (e.g. lockfile with 78256 downloads from code.google.com). You didn't include such in your 2651 emails, or did you? holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Thu, Feb 28, 2013 at 13:56 +0100, Reinout van Rees wrote: On 28-02-13 10:43, holger krekel wrote: On Thu, Feb 28, 2013 at 06:38 +0100, Andreas Jung wrote: I give a shit at the arguments pulled out every time by package maintainers using PyPI only for listing their packages. I am both annoyed and bothered by these people. I didn't see such positions from package maintainers here. In fact i haven't seen anyone stepping up saying listing packages externally is a great idea. Could you point to those posts? The position Andreas probably means is projects that *do* advertise themselves on pypi, but don't put their files there. It has been an accepted practise for 10 years. I have seen that position in this discussion (I have to upload 120 files per release, so I won't do that, for instance). haven't seen that. Some arguments might be valid, but these projects *are*, taken as one group, actively breaking pip and buildout regularly. yes, and it's annoying, fully agreed. So I agree with Andreas. I don't really care about the arguments pulled out every time. Effectively actively breaking pip and buildout is bad, period. I consider it a valid concern that taking homepage/download urls away from pypi's server index is likely to break things for users. I don't see the point of doing that if we can have a better migration path by working on the installers (like is currently ongoing). Let's please not do a blackwhite discussion here and try to improve the overall situation, not just a particular aspect in a particular way. holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Fri, Mar 01, 2013 at 10:02 +0100, Reinout van Rees wrote: On 28-02-13 21:08, holger krekel wrote: I have seen that position in this discussion (I have to upload 120 files per release, so I won't do that, for instance). haven't seen that. Marc-Andre Lemburg said this, which I took to mean 120 uploads per release: However, taking our egenix-mx-base package as example, we have 120 distribution files for every single release. Uploading those to PyPI would not only take long, but also ... Ah ok, thanks. Didn't interpret Marc-Andre's post as claiming that downloads/homepage crawling is a good idea, though. Just that there has been reasons not to upload things which need to be addressed, especially the need for enough storage space. best, holger Reinout -- Reinout van Reeshttp://reinout.vanrees.org/ rein...@vanrees.org http://www.nelen-schuurmans.nl/ If you're not sure what to do, make something. -- Paul Graham ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Deprecate External Links
On Fri, Mar 01, 2013 at 10:24 +0100, M.-A. Lemburg wrote: On 01.03.2013 10:02, Reinout van Rees wrote: On 28-02-13 21:08, holger krekel wrote: I have seen that position in this discussion (I have to upload 120 files per release, so I won't do that, for instance). haven't seen that. Marc-Andre Lemburg said this, which I took to mean 120 uploads per release: However, taking our egenix-mx-base package as example, we have 120 distribution files for every single release. Uploading those to PyPI would not only take long, but also ... Correct, with a total of over 100MB per release. However, the above quote is slightly incorrect: I did not say I won't do that, just that there are issues with doing this: * It currently takes too long uploading that many files to PyPI. This causes a problem, since in order to start the upload, we have to register the release on PyPI, which tools will then immediately find. However, during the upload time, they won't necessarily find the right files to download and then fail. You can actually skip the register and directly upload, it will create release metadata on the fly. Not sure if it's complete but you can then do a register to update it if needed. best, holger The proposed pull mechanism (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal) would work around this problem: tools would simply go to our servers in case they can't find the files on PyPI. * PyPI doesn't allow us to upload two egg files with the same name: we have to provide egg files for UCS2 Python builds and UCS4 Python builds, since easy_install/setuptools/pip don't differentiate between the two variants. This is the main reason why we're hosting our own PyPI-style indexes, one for UCS2 and the other for UCS4 builds: https://downloads.egenix.com/python/index/ucs2/ https://downloads.egenix.com/python/index/ucs4/ * I'm not sure whether we want to import our crypto packages to the US, so for a subset of the files, we'd probably continue to use our servers in Germany. Again, with the above proposal, this shouldn't be a problem. * Ihe PyPI terms are a bummer for us, but this can be fixed, I guess. If we can resolve the issues, we'd have no problem having the files mirrored on PyPI. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 01 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] homepage/download metadata cleaning
Hi Richard, all, somewhere deep in the threads i mentioned i wrote a little cleanpypi.py script which takes a project name as an argument and then goes to pypi.python.org and removes all homepage/download metadata entries for this project. This sanitizes/speeds up installation because pip/easy_install don't need to crawl them anymore. I just did this for three of my projects, (pytest, tox and py) and it seems to work fine. Now before i release this as a tool, i wonder: Is it a good idea to remove download/homepage entries? Is there any current machine use (other than the dreaded crawling) for the homepage/download_url per-release metadata fields? For humans the homepage link is nicely discoverable if the long-description doesn't mention it prominently. But i think there also is a project url or bugtrack url for a project so maybe those could be used to reference these important pages? (i am a bit confused on the exact meaning of those urls, btw). Should we maybe stop advertising homepage and download_url and instead see to extend project-url/bugtrackurl to be used and shown nicely? The latter are independent of releases which i think makes sense - what use are old probably unreachable/borked homepages anyway. And it's also not too bad having to go once to pypi.python.org to set it, usually it seldomly changes. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] homepage/download metadata cleaning
On Fri, Mar 01, 2013 at 06:09 -0500, Donald Stufft wrote: On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote: On 01.03.2013 11:19, holger krekel wrote: Hi Richard, all, somewhere deep in the threads i mentioned i wrote a little cleanpypi.py script which takes a project name as an argument and then goes to pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for this project. This sanitizes/speeds up installation because pip/easy_install don't need to crawl them anymore. I just did this for three of my projects, (pytest, tox and py) and it seems to work fine. Does it also cleanup the links that PyPI adds to the /simple/ by parsing the project description for links ? I think those are far nastier than the homepage and download links, which can be put to some good use to limit the external lookups (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal) See e.g. https://pypi.python.org/simple/zc.buildout/ for a good example of the mess this generates... even mailto links get listed and file:/// links open up the installers for all kinds of nasty things (unless they explicitly protect against following these). pip at least, and I assume the other tools don't spider those links, but they do consider them for download (e.g. if the link looks installable it will be a candidate for installing, but it won't fetch it, and look for more links like it will donwnload_url/home_page). I believe that's the way it's structured atm. That's right. Even though the long-description extracted links look ugly on a simple/PKGNAME page, neither pip nor easy_install do anything with them except if the href ends in #egg=PKGNAME- in which case they are taken as pointing to a development tarball (e.g. at github or bitbucket). ASFAIK a link like PKGNAME-VER.tar.gz will not be treated as an installation candidate, just the #egg=PKGNAME one. best, holger Now before i release this as a tool, i wonder: Is it a good idea to remove download/homepage entries? Is there any current machine use (other than the dreaded crawling) for the homepage/download_url per-release metadata fields? For humans the homepage link is nicely discoverable if the long-description doesn't mention it prominently. But i think there also is a project url or bugtrack url for a project so maybe those could be used to reference these important pages? (i am a bit confused on the exact meaning of those urls, btw). Should we maybe stop advertising homepage and download_url and instead see to extend project-url/bugtrackurl to be used and shown nicely? The latter are independent of releases which i think makes sense - what use are old probably unreachable/borked homepages anyway. And it's also not too bad having to go once to pypi.python.org (http://pypi.python.org) to set it, usually it seldomly changes. I think it would be better to differentiate between showing the fields on the project pages, where they provide useful resources for people, and their use on the /simple/ index pages which are meant for programs to parse. IMO, the homepage and download links on the project pages are indeed very useful for people. On the /simple/ index a homepage link is probably not all that useful (provided a download link is set). The download links serve the purpose of directing tools to the right location, so those do belong on the /simple/ index listings. I'd completely remove the links parsed from the descriptions, since those don't really provide a good basis for crawling (the description is meant for humans to parse, not programs). -- Marc-Andre Lemburg eGenix.com (http://eGenix.com) Professional Python Services directly from the Source (#1, Mar 01 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com (http://eGenix.com) Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Catalog-SIG mailing list Catalog-SIG@python.org (mailto:Catalog-SIG@python.org) http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] PyPI terms
On Fri, Mar 01, 2013 at 15:11 +0100, M.-A. Lemburg wrote: On 01.03.2013 15:02, Jesse Noller wrote: Okie doke. So we can move on to putting up the CDN and deprecating external links for now? I don't think anyone is against putting up a CDN. It should meet the same security requirements we have for the pypi server itself, ie. HTTPS all the way, proper certificates, operated by the PSF, perhaps run on a different domain, and whatever other goodies Donald can come up with ;-) For the external links we need a migration path... that's in the works. See http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal for a proposal that allows migrating away from relying on external hosts in a backwards compatible and secure way. The page doesn't describe the current scraping situation accurately. As mentioned in my last post, pip/easy_install do _not_ visit all links found in simple/PKGNAME. Only the ones with rel=home_page or rel=download. So the proposal effectively says to not visit homepage links by default and use a special format for download ones. The special format i am not sure about - i guess the SHA256 hash there is to make sure the target content is the correct one, right? What about abusing download_url some more and do a multiline-format like this: HASH1 URL-TO-RELEASE-FILE1 HASH2 URL-TO-RELEASE-FILE2 This way we can avoid any additional http-requests on the pip/easy_install client side _and_ allow multiple release files. The simple/PKGNAME metadata would contain all information that is needed (and we could probably introduce a special syntax for #egg github/bitbucket-style tarballs). Those URLs would only be retrieved if the client-side installer determines it needs them because of the user-required version. You wouldn't need to create a special -download.html file then, no additional http requests, and it's easy to create this format without much tool support. Can't incorporate this into the wiki right now myself and i'd probably like to structure the page differently. The issue here really is the (future) behaviour of easy_install and pip, not so much distutils or the pypi server (apart from the worthwhile-to-consider idea of pulling/caching things). On a side note i'd rather prefer this to be a github/bitbucket project where i can submit a pull request :) best, holger -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 01 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] homepage/download metadata cleaning
On Fri, Mar 01, 2013 at 23:50 +0100, Lennart Regebro wrote: On Fri, Mar 1, 2013 at 8:31 PM, M.-A. Lemburg m...@egenix.com wrote: Hmm, then why not remove links that don't match the above from the /simple/ index pages ? I think we can do that, but if we *start* with that, we will just suddenly, with no warning, break everything. Its' better if the installation tools can first warn, then remove their support for this, and *then* we remove these links from /simple/. I think Marc-Andre was just refering to the superflous links from the long-description, namely all links which don't match the #egg format and don't have a rel of download/homepage. Phillip clarified that pypi served all long-description links at the time to leave it to the tools to interpret them. The interpretation is now pretty clear and so pypi doesn't need to provide them. It shouldn't break neither pip nor easy_install to remove those unused long-description links. That way we break things gradually, with warnings so that package managers can react and adapt. I generally agree to this strategy but would add that we should also consider the life of system admins or other package installers who may not be able to get maintainers to make new releases. For me this mainly means to aim for changing defaults in pip and easy_install but not to remove crawling abilities completely for the time being. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Fw: Deprecate External Links
On Tue, Mar 05, 2013 at 04:19 -0500, Donald Stufft wrote: Forwarding this since I assume it was accidently sent to only me, and it's important to note that there is some sort of miscounting bug going on. Forwarded message: From: Donald Stufft donald.stu...@gmail.com To: M.-A. Lemburg m...@egenix.com Date: Tuesday, March 5, 2013 4:16:53 AM Subject: Re: [Catalog-sig] Deprecate External Links On Tuesday, March 5, 2013 at 4:12 AM, M.-A. Lemburg wrote: Perhaps I'm misunderstanding, but if the list contains packages that: * are installable via pip * are not hosted on PyPI then why isn't e.g. egenix-mx-base included in that list ? Unsure, must be a bug in the script. I saw some BadStatusLine errors during the processing but I just assumed they were issues with the server pip was trying to fetch from. I'll see if I can't sort out the reasoning that egenix-mx-base doesn't show up. FYI lockfile is also not in your list, and it only had lockfile-0.2 at Pypi, the rest up to 0.9.1 is all at code.google (latest is lockfile-0.9.1.tar.gz). best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] hash tags
Hi Philip, all, On Fri, Mar 08, 2013 at 14:16 -0500, PJ Eby wrote: The key to making this transition isn't creating elaborate new standards for the tools, it's *creating new tools for the standards*. If we can find a way to improve PyPI and not require the world to change first, that's a big plus in my book as well. Point is, this entire thing can be done correctly at the PyPI end and work with the existing API of the download tools. I think so as well. Will suggest a transition model in a new top-level thread, trying to follow this idea. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
Hi Donald, Richard, Nick, Philip, Marc-Andre, all, after some more thinking i wrote a simplified PEP draft for transitioning hosting of release files to pypi.python.org. A PEP is warranted IMO because the according changes will affect all python package maintainers and the Python packaging ecology in general. See the current draft (pre-submit-v1) further below in this mail. I also created a bitbucket repository, see PEP-PYPI-DRAFT.txt at https://bitbucket.org/hpk42/pep-pypi/src Donald, i'd be happy if you join as a co-author and contribute your statistics script and possibly more implementation stuff (PRs to pypi software etc.). Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig: scrutiny and feedback welcome. Nick: if you could collect feedback on the PEP (draft) around the packaging and distribution mini-summit at Pycon US (15th March), that'd be very useful. Richard: I may ask you to become BDFL-delegate for this PEP especially since you will need to integrate any resulting changes :) I'd like to formally submit this PEP soon but not before i got some feedback. I am not subscribed to distutils-sig and i think distutils is not much affected, but it probably still would help if someone cross-posts there (please put me in CC). cheers, holger PEP-draft: transition to release file hosting at pypi.python.org = Status --- PRE-SUBMIT-v1 Abstract This PEP proposes to move hosting of all release files to pypi.python.org itself. To ease transition and minimize client-side friction, **no changes to distutils or installers** are required. Rather, the transition is implemented through changes to the pypi.python.org implementation and by interactions with package maintainers. Problem --- Today, python package installers (pip and easy_install) need to query multiple sites to discover release files. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package need to be crawled by an installer. The need for installers to crawl 3rd party sites slows down installation and makes for a brittle unreliable installation process. As of March 2013, about 10% of packages have release files which are not hosted directly from pypi.python.org but rather from places referenced by download/homepage sites. Conversely, roughly 90% of packages are hosted directly on pypi.python.org [1]_. Even for them installers still need to crawl the homepage(s) of a package. Many package uploaders are particularly not aware that specifying the homepage will slow down the installation process. Solution --- Each package is going to get a hosting mode field which effects all historic and future releases of a package and its release files. The field has these values and meanings: - pypi-ext (transitional) encodes exactly the current mode of operations: homepage/download urls are presented in simple/ pages and client-side tools need to crawl them themselves to find release file links. - pypi-cache: Release files located on remote sites will be downloaded and cached by pypi.python.org by crawling homepage/download metadata sites. The resulting simple index contains links to release files hosted by pypi.python.org. The original homepage/download links are added as links without a ``rel`` attribute if they have the ``#egg`` format. - pypi-only: homepage/download links are served on simple indexes but without a ``rel`` attribute. Installation tools will thus not crawl those pages anymore. Use this option if you commit to always uploading your release files to pypi.python.org. Phases of transition - 1. At the outset, we set hosting-mode to pypi-ext for all packages. This will not change any link served via the simple index and thus no bad effects are expected. Early adopters and testers may now change the mode to either pypi-only or pypy-cache to help with streamlining issues. After implementation and UI issues are streamlined, the next phase can start. 2. We perform automatic analysis for each package to determine if it is a package with externally hosted release files. Packages which only have release files on pypi.python.org are put in the group A, those which have at least some packages outside are put in the group B. We sent then a mail to all maintainers of packages in A that their hosting-mode is going to be switched automatically to pypi-only after N weeks, unless they visit their package administration page earlier and set it to either pypi-cache or pypi-only earlier. We sent then a mail to all maintainers of packages in B that their hosting-mode is going to be switched automatically to pypi-cache after N weeks, unless they visit their package administration
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Sun, Mar 10, 2013 at 13:35 -0400, Donald Stufft wrote: On Mar 10, 2013, at 11:07 AM, holger krekel hol...@merlinux.eu wrote: [...] Transitioning to pypi-cache mode - When transitioning from the currently implicit pypi-ext mode to pypi-cache for a given package, a package maintainer should be able to retrieve/verify the historic release files which will be cached from pypi.python.org. The UI should present this list and have the maintainer accept it for completing the transition to the pypi-cache mode. Upon future release registration actions, pypi.python.org will perform crawling for the homepage/download sites and cache release files *before* returning a success return code for the release registration. [...] Some concerns: 1. We cannot automatically switch people to pypi-cache. We _have_ to get explicit permission from them. Could you detail how you arrive at this conclusion? (I've seen the claim before but not the underlying reasoning, maybe i just missed it) There would be prior notifications to the package maintainers. If they don't want to have their packages cached at pypi.python.org, they can set the mode to pypi-only and leave manual instructions. I suspect there will be very few people if anyone, objecting to pypi-cache mode. If that is false we might need to prolong pypi-ext mode some more for them and eventually switch them to pypi-only when we eventually decide to get rid of external hosting. 2. The cache mechanism is going to be fragile, and in the long term leaves a window open for security issues. fragility: not sure it's too bad. Once the mode is activited release registration (submit POST action on /pypi http endpoint) will only succeed if according releases can be found through homepage/download. Changing the mode to pypi-cache in the presence of historic release files hosted elsewhere needs a good pypi.python.org UI interaction and may take several tries if neccessary sites cannot be reached. Nevertheless, this step is potentially fragile [X]. Security: the PEP does not try to prevent package tampering. MITM attacks between pypi.python.org and the download sites may occur as much as they can happen today between installers and the download sites. I think we should consider protection against package tampering in a separate discussion/PEP. If we're going to do a phased in per project solution like this I think it would work much better to have 2 modes. 1. Legacy - Current behavior, new external links are accepted, existing ones are displayed 2. PyPI Only - New behavior, no new external links are accepted, existing ones are removed Present the project owners with 2 one way buttons: - Switch to PyPI Only and re-host external files [1] Doesn't this have the same fragility problem as [X] above? - Switch to PyPI Only and do NOT re-host external files Are there any problems for doing this automatically (with a prior notification to maintainers) for all the projects where we don't find externally hosted packages? I'd expect very few false negatives and they can be quickly switched back. Back to pypi-cache: it is there to make it super-easy for package maintainers. There are all kinds of release habits and scripts pushing out things to google/bitbucket/github/other sites. With pypi-cache they don't need to change any of that. They just need to be fine with pypi.python.org pulling in the packages for caching. We might think about phasing out pypi-cache after some larger time frame so that we eventually only have pypi-only and things are eventually simple and saner. best, holger These buttons would be one time and quit. Once your project has been switched to PyPI Only you cannot go back to Legacy mode. All new projects would be already switched to PyPI Only. After some amount of time switch all Projects to PyPI Only but _do not_ re-host their packages as we cannot legally do so without their permission. The above is simpler, still provides people an easy migration path, moves us to remove external hosting, and doesn't entangle us with legal issues. [1] There is still a small window here where someone could MITM PyPI fetching these files, however since it would be a one time and down deal this risk is minimal and is worth it to move to an pypi only solution. - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Sun, Mar 10, 2013 at 14:29 -0400, Donald Stufft wrote: On Mar 10, 2013, at 2:18 PM, holger krekel hol...@merlinux.eu wrote: On Sun, Mar 10, 2013 at 13:35 -0400, Donald Stufft wrote: On Mar 10, 2013, at 11:07 AM, holger krekel hol...@merlinux.eu wrote: [...] Transitioning to pypi-cache mode - When transitioning from the currently implicit pypi-ext mode to pypi-cache for a given package, a package maintainer should be able to retrieve/verify the historic release files which will be cached from pypi.python.org. The UI should present this list and have the maintainer accept it for completing the transition to the pypi-cache mode. Upon future release registration actions, pypi.python.org will perform crawling for the homepage/download sites and cache release files *before* returning a success return code for the release registration. [...] Some concerns: 1. We cannot automatically switch people to pypi-cache. We _have_ to get explicit permission from them. Could you detail how you arrive at this conclusion? (I've seen the claim before but not the underlying reasoning, maybe i just missed it) There would be prior notifications to the package maintainers. If they don't want to have their packages cached at pypi.python.org, they can set the mode to pypi-only and leave manual instructions. I suspect there will be very few people if anyone, objecting to pypi-cache mode. If that is false we might need to prolong pypi-ext mode some more for them and eventually switch them to pypi-only when we eventually decide to get rid of external hosting. I asked VanL. His statement on re-hosting packages was: We could do it if we had permission. The tricky part would be getting permission for already-existing packages. I'm pretty sure that emailing someone and assuming we have permission if they don't opt-out doesn't count as permission. Hum, i I saw Jesse Noller saying a few days ago let them opt out. But i guess VanL can trump that :) If that is true we could change the notification to maintainers of B packages that hosting mode is going to change to pypi-only, which would loose their release files unless they opt-in to pypi-cache. As long as that is a no-brainer for them, we are not asking for much and can count on most people's good will to not make other people's installation life harder. Besides, admins could still set the pypi-ext mode if a maintainer can explain why it's a problem for them to agree to pypi-cache or pypi-only. I'd really like to not have too many packages lingering around in pypi-ext mode if it can be avoided. 2. The cache mechanism is going to be fragile, and in the long term leaves a window open for security issues. fragility: not sure it's too bad. Once the mode is activited release registration (submit POST action on /pypi http endpoint) will only succeed if according releases can be found through homepage/download. Changing the mode to pypi-cache in the presence of historic release files hosted elsewhere needs a good pypi.python.org UI interaction and may take several tries if neccessary sites cannot be reached. Nevertheless, this step is potentially fragile [X]. I see, so pypi-cache would only be triggered once during release creation. Cache makes it sound like we'd continuously monitor the given external urls instead of it actually being a pull based method of getting files. Right, we need to avoid cache invalidation problems by only allowing updates at user-chosen point in times (there might also be an explicit update cache button in case a maintainer pushes a egg/wheel later). It's still technically a cache i think but the term rehost would work as well i guess. [...] Back to pypi-cache: it is there to make it super-easy for package maintainers. There are all kinds of release habits and scripts pushing out things to google/bitbucket/github/other sites. With pypi-cache they don't need to change any of that. They just need to be fine with pypi.python.org pulling in the packages for caching. Yes I understand the goal here. The problem is that there's not really a good way to secure this without requiring changes to their workflow. At best they'll have to push information about every file so that PyPI is able to verify the files it is downloading, and if we are requiring them to push data about those files we might as well require them to push the files themselves. Is this about protection against package tampering? If so, I think a proper solution involves maintainers signing their release files but this is outside the intended scope of the PEP. Otherwise, the re-hosting process for pypi-cache mode is at least as secure as currently where all hosts issuing pip/easy_install commands visit external sites and can thus be MITM-attacked. For pypi-only server packages it's safer because
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
Hi Philip, thanks for your helpful review, almost all makes sense to me ... some more inlined comments below. Up front, i am open to you co-authoring the PEP if you like and share the goal to find a minimum viable approach to speed up and simplify the interactions for installers. On Sun, Mar 10, 2013 at 15:41 -0400, PJ Eby wrote: On Sun, Mar 10, 2013 at 11:07 AM, holger krekel hol...@merlinux.eu wrote: Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig: scrutiny and feedback welcome. Hi Holger. I'm having some difficulty interpreting your proposal because it is leaving out some things, and in other places contradicting what I know of how the tools work. It is also a bit at odds with itself in some places. Certainly, it was a quick draft to get the process going and useful feedback which worked so far :) For instance, at the beginning, the PEP states its proposed solution is to host all release files on PyPI, but then the problem section describes the problems that arise from crawling external pages: problems that can be solved without actually hosting the files on PyPI. To me, it needs a clearer explanation of why the actual hosting part also needs to be on PyPI, not just the links. In the threads to date, people have argued about uptime, security, etc., and these points are not covered by the PEP or even really touched on for the most part. Makes sense to clarify this more. (Actually, thinking about that makes me wonder Donald: did your analysis collect any stats on *where* those externally hosted files were hosted? My intuition says that the bulk of the files (by *file count*) will come from a handful of highly-available domains, i.e. sourceforge, github, that sort of thing, with actual self-hosting being relatively rare *for the files themselves*, vs. a much wider range of domains for the homepage/download URLs (especially because those change from one release to the next.) If that's true, then most complaints about availability are being caused by crawling multiple not-highly-available HTML pages, *not* by the downloading of the actual files. If my intuition about the distribution is wrong, OTOH, it would provide a stronger argument for moving the files themselves to PyPI as well.) Digression aside, this is one of things that needs to be clearer so that there's a better explanation for package authors as to why they're being asked to change. And although the base argument is good (specifying the homepage will slow down the installation process), it could be amplified further with an example of some project that has had multiple homepages over its lifetime, listing all the URLs that currently must be crawled before an installer can be sure it has found all available versions, platforms, and formats of the that project. Right, an example makes sense. Okay, on to the Solution section. Again, your stated problem is to fix crawling, but the solution is all about file hosting. Regardless of which of these three hosting modes is selected, it remains an option for the developer to host files elsewhere, and provide the links in their description... unless of course you intended to rule that out and forgot to mention it. (Or, I suppose, if you did *not* intend to rule it out and intentionally omitted mention of that so the rabid anti-externalists would think you were on their side and not create further controversy... in which case I've now spoiled things. Darn. ;-) ) To be honest, while drafting i forgot about the fact that the long_description can contain package links as well. Some technical details are also either incorrect or confusing. For example, you state that The original homepage/download links are added as links without a ``rel`` attribute if they have the ``#egg`` format. But if they are added without a rel attribute, it doesn't *matter* whether they have an #egg marker or not. It is quite possible for a PyPI package to have a download_url of say, http://sourceforge.net/download/someproject-1.2.tgz;. Right. I just wanted to clarify that the distutils metadata download_url can contain an #egg format link and that this link should still be served (without a rel). Thus, I would suggest simply stating that changing hosting mode does not actually remove any links from the /simple index, it just removes the rel= attributes from the Home page and Download links, thus preventing them from being crawled in search of additional file links. That's certainly a better description of what effectively happens and avoids the special mentioning of #egg. With that out of the way, that brings me to the larger scope issue with the modes as presented. Notice now that with this clarification, there is no real difference in *state* between the pypi-cache and pypi-only modes. There is only a *functional* difference... and that function is underspecified in the PEP. Agreed. What I mean
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
Hi again, A correction on one point of my last mail to you, On Mon, Mar 11, 2013 at 10:02 +, holger krekel wrote: My suggestion would be to do two things: First, make the state a boolean: crawl external links, with the current state yes and the future state no, with no simply meaning that the rel= attribute is removed from the links that currently have it. Second, propose to offer tools in the PyPI interface (and command line) to assist authors in making the transition, rather than proposing a completely unspecified caching mechanism. Better to have some vaguely specified tools than a completely unspecified caching mechanism, and better still to spell out very precisely what those tools do. This structure makes sense to me except that i see the need to start off with pypi-ext, i.e. a third state which encodes the current behaviour. Wait, your suggestion of a boolean crawl external set to yes would encode the current behaviour, so my except is invalid. Thing is that the pypi.python.org doesn't have an extensive test suite and we will thus need to rely on a few early adopters using the tools/state-changes before starting phase 2 (mass mailings etc.). Also in case of problems we can always switch back packages to the safe pypi-ext mode. IOW, the motiviation for this third state is considering the actual implementation process. This can also be done with your two-state suggestion (switching back to crawl=yes). So no disagreement on that either. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] A 90% Solution
On Mon, Mar 11, 2013 at 19:04 -0400, PJ Eby wrote: Just a thought, but... If 90% of PyPI projects do not have any external files to download, then, wouldn't it make sense to: sidenote: we need to verify and clarify the 90/10 ratio. It would be the basis for action/changing pypi-state so we need to have this accurate and double-checked. 1. Add a project-level option to enable or disable the adding of the rel= attribute to /simple links (but not affecting the links in any other way) 2. Default it to disabled for new projects, and 3. Set it to disabled *now* for the 90% of projects that *don't have external files*? If the arguments about banning external links are as valid and important as some people claim, wouldn't it make sense to do this part *now*, without first requiring a commitment to force the switch to a disabled state in the future? Pre-announcing the step to maintainers is good communication style. There is always the issue of bugs in your determination of external hosting or tools that rely on rel attributes without us knowing etc. Immediately, 90% of the problem goes away - no random spidering of stuff that doesn't contain a link now, but which could be taken over by a malicious party in the future, and 90% fewer sites having to be up in order for you to build something from PyPI. Seems like a serious win to me -- and one that might not even need a PEP. Yes and no: a PEP-like document is a good place to point people to. Next steps after this would be providing tools to help people move their files and links, promoting that people switch it off if they no longer support the offsite links, educating about security concerns, etc. I really don't understand why the 90% solution isn't *already* the consensus position, since it doesn't preclude follow-on efforts towards reducing the 10% towards 0%. And if the problem is so important, why must we keep 90% of the problems in place, just so we can keep arguing about censoring the 10%? That doesn't make sense to me. The idea for only changing the pypi-server side only evolved last week - so we are not that slow in moving on here :) cheers, holger To me, if somebody's injured, the first thing you do is clean and close the wound, not argue about whether it's a complete solution and what might happen days or weeks later. Just a thought. ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
, performing a Man-in-The-Middle (MITM) attack between an installation site and any of the download sites can inject mailicious packages on the installation site. As many homepages and download locations are using HTTP and not proper HTTPS, such attacks are not very hard to launch. Such MITM attacks can happen even for packages which never intended to host files externally as their homepages are contacted by installers anyway. There is currently no way for package maintainers to avoid 3rd party crawling, other than removing all homepage/download url metadata for all historic releases. While a script [3]_ has been written to perform this action, it is not a good general solution because it removes semantic information like the homepage specification from PYPI packages. Solution --- The proposed solution consists of the following implementation and communication steps: - determine which packages have releases files only on PYPI (group A) and which have externally hosted release files (group B). - Prepare PYPI implementation to allow a per-project hosting mode, effectively enabling or disabling external crawling. When enabled nothing changes from the current situation of producing ``rel=download`` and ``rel=homepage`` attributed links on ``simple/`` pages, causing installers to crawl those sites. When disabled, the attributions of links will change to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to avoid crawling 3rd party sites. Retaining the meta-information allows tools to still make use of the semantic information. - send mail to maintainers of A that their project is going to be automatically configured to disable crawling in one week and encourage them to set this mode earlier to help all of their users. - send mail to maintainers of B that their package hosting mode is crawling enabled, and list the sites which currently are crawled, and suggest that they re-host their packages directly on PYPI and then switch the hosting-mode disable crawling. Provide instructions and at best tools to help with this re-uploading process. In addition, maintainers of installation tools are asked to release two updates. The first one shall provide clear warnings if external crawling needs to happen, for which projects and URLS exactly this happens, and that in the future crawling will be disabled by default. The next update shall change the default to disallow crawling and allow crawling only with an explicit option like ``--crawl-externals`` and another option allowing to limit which hosts are allowed to be crawled at all. Hosting-Mode state transitions -- 1. At the outset, we set hosting-mode to notset for all packages. This will not change any link served via the simple index and thus no bad effects are expected. Early adopters and testers may now change the mode to either crawl or nocrawl to help with streamlining issues in the PYPI implementation. 2. When maintainers of B packages are mailed their mode is directly set to crawl. 3. When maintainers of A are mailed we leave the mode at notset to allow people to change it to nocrawl themselves or to set it to crawl if they think they are wrongly in the A group. After a week all notset modes are set to nocrawl. A week after the mailings all packages will be in crawl or nocrawl hosting mode. It is then a matter of good tools and reaching out to maintainers of B packages to increase the A/B ratio. Open questions -- - Should the support tools for rehosting packages be implemented on the server side or on the client side? Implementing it on the client side probably is quicker to get right and less fatal in terms of failures. - double-check if ``rel=newhomepage`` and ``rel=newdownload`` cause the desired behaviour of pip and easy_install (both the distribute and setuptools based one) to not crawl those pages. - are the support tools for re-hosting outside the scope of this PEP? - Think some more about pip/easy_install allow-hosts mode etc. References .. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html .. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html .. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html Acknowledgments -- Philip Eby for precise information and the basic ideas to implement the transition via server-side changes only. Donald Stufft for pushing away from external hosting and doing the 90/10 % statistics script and offering to implement a PR. Marc-Andre Lemburg, Nick Coghlan and catalog-sig for thinking through issues regarding getting rid of external hosting. Copyright
Re: [Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
On Wed, Mar 13, 2013 at 01:19 +1000, Nick Coghlan wrote: That looks pretty good to me. My only comment is that qualifiers like new don't age well in an API. The explicit nocrawlhomepage and nocrawldownload might be a better choice. Right, we might also consider dropping rel-attributing given that you can indeed access release metadata via the xmlrpc or json API. best, holger Cheers, Nick. ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
On Tue, Mar 12, 2013 at 11:53 -0400, PJ Eby wrote: On Tue, Mar 12, 2013 at 7:38 AM, holger krekel hol...@merlinux.eu wrote: In addition, maintainers of installation tools are asked to release two updates. The first one shall provide clear warnings if external crawling needs to happen, A clarification here: needs to happen is not well-specified. An installer tasked with finding the latest or best-matching version of a package must currently *always* crawl. So the warning would be always. Not after the initial automatic PYPI transition. For the 90% of the packages you wouldn't see the warning then. The strategy I originally chose for making this change in easy_install is to warn once at the beginning that --allow-hosts has not been set, and thus packages might be downloaded from anywhere on the internet. From a UI perspective i'd like to see a summary of actually consulted but non-specified websites (including if it was http or https) at the very end of an installers output. With non-specified i mean sites that weren't specified as an indexserver or allow-host. I've since become uncertain that this change is actually workable in the short term, since until most of the packages are actually moved onto PyPI, a lot of installs will fail if somebody changes their configuration to be more secure. So I'm thinking the warning needs to be deferred until at least the more popular packages have moved to PyPI. I think it's fine to wait until after the initial hosting-mode transition. Now, if there is some agreement, i can submit this PEP officially tomorrow, and given agreement/refinments from the Pycon folks and the likes of Richard, we may be able to get going very shortly after Pycon. I'd like to suggest that the PEP should be explicit that no other changes to the /simple generation algorithm are being made, just the removal or alteration of rel= attributes. i.e., it will still be possible -- at least in the near term -- for projects to include explicit download links to files made available elsewhere. Changing that situation is more controversial and will require wider community participation than has occurred to date. I kind of agree. To transition forward , we should leave out the question of further modifying the simple/ pages at the moment. Mentioning that this means you can put http://PKGNAME-VER.tar.gz; in your PKGNAME long_description or download_url metadata makes sense. For that, the installers will give warnings, however, and eventually change defaults according to the PEP draft. It might also be good to suggest that authors of PyPI clones plan their own phase-out of rel= attributes. Most alternative servers i've seen don't use the rel attribution but it's good to mention it. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
Hi Marc-Andre, all, On Tue, Mar 12, 2013 at 17:06 +0100, M.-A. Lemburg wrote: On 12.03.2013 12:38, holger krekel wrote: Hi all, below is the new PEP pre-submit version (V2) which incorporates the latest suggestions and aims at a rapidly deployable solution. Thanks in particular to Philip, Donald and Marc-Andre. I also added a few notes on how installers should behave with respect to non-PYPI crawling. I think a PEP like doc is warranted and that we should not silently change things without proper communication to maintainers and pre-planning the implementation/change process. Arguably, the changes are more invasive than oh, let's just do a http-https redirect which didn't work too well either. Now, if there is some agreement, i can submit this PEP officially tomorrow, and given agreement/refinments from the Pycon folks and the likes of Richard, we may be able to get going very shortly after Pycon. cheers, holger PEP-draft: transitioning to release-file hosting on PYPI Status --- PRE-SUBMIT-v2 Abstract This PEP proposes a backward-compatible transition process to speed up, simplify and robustify installing from the pypi.python.org (PYPI) package index. The initial transition will put most packages on PYPI automatically in a configuration mode which will prevent client-side crawling from installers. To ease automatic transition and minimize client-side friction, **no changes to distutils or installation tools** are required. Instead, the transition is implemented by modifying PYPI to serve links from ``simple/`` pages in a configurable way, preventing or allowing crawling of non-PYPI sites for detecting release files. Maintainers of all PYPI packages will be notified ahead of those changes. Maintainers of packages which currently are hosted on non-PYPI sites shall receive instructions and tools to ease re-hosting of their historic and future package release files. The implementation of such tools is NOT required for implementing the initial automatic transition. Installation tools like pip and easy_install shall warn about crawling non-PYPI sites and later default to disallow it and only allow it with an explicit option. History and motivations for external hosting When PYPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Philip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. This was implemented by the PYPI ``simple/`` index containing links of type ``rel=homepage`` or ``rel=download`` which are crawled by installation tools to discover package links. As of March 2013, a substantial part of packages (estimated to about 10%) make use of this mechanism to host files on github, bitbucket, sourceforge or own hosting sites like ``mercurial.selenic.com``, to just name a few. There are many reasons [2]_ why people choose to use external hosting, to cite just a few: - release processes and scripts have been developed already and upload to external sites - it takes too long to upload large files from some places in the world - export restrictions e.g. for crypto-related software - company policies which prescribe offering open source packages through own sites - problems with integrating uploading to PYPI into one's release process (because of release policies) - perceived bad reliability of PYPI - missing knowlege you can upload files Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. Problem --- **Today, python package installers (pip and easy_install) often need to query non-PYPI sites even if there are no externally hosted files**. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package are crawled by an installer. The need for installers to crawl 3rd party sites slows down installation and makes for a brittle unreliable installation process. Those sites and packages also don't take part in the :pep:`381` mirroring infrastructure, further decreasing reliability and speed of automated installation processes around the world. Roughly 90% of packages are hosted directly on pypi.python.org [1]_. Even for them installers still need to crawl the homepage(s) of a package. Many package uploaders are particularly not aware that specifying the homepage in their release
Re: [Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
Hi Carl, On Tue, Mar 12, 2013 at 10:48 -0600, Carl Meyer wrote: Hi Holger, I am confused about the discrepancy between the title of this pre-PEP (transition to release file hosting on PyPI) and the contents of the PEP, which describe a transition to not crawling _HTML pages_ on external sites looking for distribution download links. These are not the same thing at all. I agree the title is not quite right at the moment. Current installer tools will only crawl external HTML pages if they are rel=download or rel=homepage, but they will use any link they find in the simple index (regardless of rel attr) if the target of the link appears to be a distribution file (as determined by filename pattern-matching or #egg fragment). Right. At the end of the process you describe, if all packages migrate to nocrawl, the rel-link HTML spidering will no longer happen. This is a good first step: it will speed up installation somewhat, and reduce the frustration of some package owners when installers find files linked from their project homepage that they never intended for automated installation. But installers will still find and download release packages that are not hosted on PyPI, if those package files are linked directly in the simple index. This is still surprising behavior to many new Python users, and still carries the security and reliability concerns that this PEP claims to address. Yes, and here the installers should move to give clear warnings and change defaults. I'm honestly not sure whether the title or the content more accurately reflects the intent of this PEP; depending which it is, I suggest one of the following: 1) Add to the PEP a description of a further step in the migration process, which actually does transition away from automated installation of non-PyPI-hosted release files (as the default behavior of installation tools); or This makes sense to me. Do you feel like opening a pull request on https://bitbucket.org/hpk42/pep-pypi to help refine this aspect? I am also on IRC for co-ordination (also about the title) as i intend to create the PEP submission for python-ideas and maybe already the pep-editors (?!). In any case, it wouldn't mean the PEP's discussion is finalized, of course, and i'd continue to post here new versions and ask for feedback. cheers, holger 2) Change the title of the PEP to something like Transitioning away from non-PyPI HTML crawling and add a paragraph to the PEP clarifying that this PEP does not address the issue of actual release files hosted off-PyPI. Carl ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Tue, Mar 12, 2013 at 13:18 -0400, PJ Eby wrote: On Tue, Mar 12, 2013 at 12:29 PM, Jacob Kaplan-Moss ja...@jacobian.org wrote: On Tue, Mar 12, 2013 at 11:19 AM, M.-A. Lemburg m...@egenix.com wrote: So let's do this carefully and find a good solution before jumping to conclusions. Completely agreed; rushing is a bad idea. But so is not starting. What I'm seeing — as a total outsider, a user of these tools, not someone who creates them — is that a bunch of people (Holger, Donald, Richard, the pip maintainers, etc.) have the beginnings of a solution ready to go *right now*, and I want to capture that energy and enthusiasm before it evaporates. This isn't an academic situation; I've seen companies decline to adopt Python over this exact security issue. Nobody told them about how to configure a restricted, site-wide default --allow-hosts setting? ( http://peak.telecommunity.com/DevCenter/EasyInstall#restricting-downloads-with-allow-hosts and http://docs.python.org/2/install/index.html#location-and-names-of-config-files ) (FWIW, --allow-hosts was added in setuptools 0.6a6 -- *years* before the distribute fork or the existence of pip, and pip offers the same option.) I've already agreed to change setuptools to default this option to only allow downloads from the same host as its index URL, in a future release. (i.e. to default --allow-hosts to the host of the --index-url option), and I support the removing of rel= spidering from PyPI (which will significantly mitigate the immediate speed and security issues). Heck, I've been the one who'se repeatedly proposed various ways of cutting back or removing rel= attributes from the /simple index. The result of these two changes will actually have the same net effect that people are being asking for here: you'll only be able to download stuff hosted on PyPI, unless you explicitly override the --allow-hosts to get a wider range of packages. Already today, when a URL is blocked by --allow-hosts, it's announced as part of easy_install's output, so you can see exactly how much wider you need to extend your trust for the download to succeed. The *only* thing I object to is removing the ability for people to *choose* their own levels of trust. And I have not yet seen an argument that justifies removing people's ability to *choose* to be more inclusive in their downloads. And I've put multiple compromise proposals out there to begin mitigating the problem *now* (i.e. for non-updated versions of setuptools), and every time, the objection is, no, we need to ban it all now, no discussion, no re-evaluation, no personal choice, everyone must do as we say, no argument. FWIW, the PEP draft in V2 doesn't take this approach and i don't plan to introduce it in subsequent versions. IOW, i agree that we should keep things backward-compatible in the sense that users can choose to use non-default settings to get the current behaviour (which might make their installation process less reliable/secure, but that's their choice). cheers, holger And I don't understand that, at all. ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Tue, Mar 12, 2013 at 12:18 -0600, Carl Meyer wrote: It seems to me that there's a remarkable level of consensus developing here (though it may not look like it), and a small set of remaining open questions. The consensus (as I see it): - Migrate away from scraping external HTML pages, with package owners in control of the migration but a deadline for a forced switch, as outlined in Holger's PEP (with all appropriate caution and testing). - In some way, migrate to a situation where the popular installer tools install only release files from PyPI by default, but are capable of installing from other locations if the user provides an option. The open question is basically how to implement the latter portion. I see two options proposed: A) Leave external links in the PyPI simple index, but migrate the major tools to not use external links by default (i.e. Philip's plan to make allow-hosts=pypi the default in a future setuptools), with an option to turn them back on. or B) Do a second PyPI migration, again with a per-package toggle and package owners in control, to a no external links in simple index setting. Consider for a moment how similar the end state here is with either A or B. In either case, by default users install only from PyPI, but by providing a special option they can install from some external source. (In B, that special option would be something like --find-links with a URL). In either case, we can continue to allow packages to register themselves on PyPI, be found in searches, etc, without uploading release files to PyPI if they prefer not to; they'll just have to provide special installation instructions to their users in that case. Here are some differences: 1) With B, we can provide a gentler migration for package owners, where they are in control of when the switch happens. With A, regardless of how it's done at some point some package owners are likely to start getting hey, i can't install your stuff anymore reports from users, and they can't control when that starts happening. 2) With B, all end users benefit from the new defaults, not only end users who update to the latest and greatest tools. 3) With B (and probably some forms of A as well), end users clearly state which external sources they would like to trust and install from, rather than having a global trust everything! flag, which is less secure and less sensible. It seems to me that option B (a controlled, per-package, PyPI migration to no-external-links in simple index) is a better migration path than A (leaving it up to external tools), and the end result either way is very similar. Thanks for outlining this so well. I agree with the B approach and suggest to introduce three per-package hosting-states then: - pypi-only: only pypi-hosted files and all #egg links are served via simple/ (#egg links are necccessary and a special case for installing development snapshots - we should not leave them out i think) - pypi-nocrawl: all links as of know but without rel-attribution (i.e. all description links are served and also the homepage/download ones but without rel-attribution) - pypi-crawl: all links as of know The automatic transition of the hosting-mode for most packages (with pre-announcements) specified in the PEP will need to differentiate between switching to pypi-only and pypi-nocrawl. And as discussed elsewhere, the implementation of the underlying analysis script and the PYPI changes certainly needs to be ready before the PEP can be finally accepted. Am open to an according PR to the PEP-draft :) holger Carl ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI
On Tue, Mar 12, 2013 at 19:07 +0100, M.-A. Lemburg wrote: Just a quick note (more later, if time permits)... On 12.03.2013 18:05, holger krekel wrote: Hi Marc-Andre, all, - Prepare PYPI implementation to allow a per-project hosting mode, effectively enabling or disabling external crawling. When enabled nothing changes from the current situation of producing ``rel=download`` and ``rel=homepage`` attributed links on ``simple/`` pages, causing installers to crawl those sites. When disabled, the attributions of links will change to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to avoid crawling 3rd party sites. Retaining the meta-information allows tools to still make use of the semantic information. Please start using versioned APIs for these things. The old style index should still be available under some URL, e.g. /simple-v1/ or /v1/simple/ or /1/simple/ Not sure it is neccessary in this case. I would think it makes the implementation harder and it would probably break PEP381 (mirroring infrastructure) as well. Here's what I meant: We publish the current implementation of the /simple/ index API under a new URL /simple-v1/, so that people that want to use the old API can continue to do so. Then we setup a new /simple-v2/ index API with your proposed change, perhaps even dropping the rel attribute altogether. During testing, we'd then have: /simple/- same as /simple-v1/ /simple-v1/ - old API with rel attributes always set /simple-v2/ - new API with your changes (rel attributes only set in some cases) After a month or so of testing, we then switch this to: /simple/- same as /simple-v2/ /simple-v1/ - old API with rel attributes always set /simple-v2/ - new API with your changes (rel attributes only set in some cases) I understand but am not sure how easy this is to manage at the moment. I'd like to put this up in open questions and have (eventually) Richard comment on this before evolving it further. best, holger -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Mar 12 2013) Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ : Try our mxODBC.Connect Python Database Interface for free ! :: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Tue, Mar 12, 2013 at 14:36 -0500, Jacob Kaplan-Moss wrote: On Tue, Mar 12, 2013 at 2:21 PM, PJ Eby p...@telecommunity.com wrote: The *only* thing I object to is the part where some people want to ban external links from /simple, always and forever, regardless of the package authors' choice in the matter. Here's the thing though, there are already a bunch of other ways users can install packages from external repositories. I can think of at least two: * I can pip/easy_install a given URL (e.g. easy_install https://www.djangoproject.com/download/1.5/tarball/) * I can use a custom index server (pip install -i http://localserver/ django) The important part is that in each of those cases I can see clearly where I'm getting things from. OTOH, if I do pip install Django I — the person making the install — have no control over where that package comes from. It really violates people's expectations that this reaches out to somewhere that's not-pypi. More importantly it prevents me from making a security choice -- I literally don't know until the download starts where the file might be coming from. From where I stand the absolutely non-negotiable part is that `pip/easy_install/whatever package` should NEVER access an external host (after some suitable transition period). This needs to include older installer software, and it needs to make it hard for new tools to do the wrong thing. How this is achieved really doesn't matter to me -- if there's a pip install --insecure Django that's fine too -- but to me it's non-negotiable that the out-of-the-box configuration not allow external hosts. Yes, this means taking some options away from the package creator. It means that when I'm wearing my author-of-Django hat I can't choose to list Django on PyPI but provide the download elsewhere. That's not perfect, but given a creator choice vs out of the box security choice the latter has to win. [And as a package creator I still have options: I can run my own package server, fairly easy to do these days.] Again, the *how* isn't a big deal to me, but the result is really important: the tooling has to be secure-by-default, and that means (among other things) `pip install package` can never hit something that's not PyPI without me explicitly asking for it. Let's be clear, however, that we are at most reducing attack vectors, there are substantial attack vectors left. Nobody should be lead to think that PYPI is a trusted or reviewed source of software even if we got rid of external hosting completely. holger Jacob ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] pre-PEP: transition to release-file hosting at pypi site
On Tue, Mar 12, 2013 at 15:21 -0400, PJ Eby wrote: On Tue, Mar 12, 2013 at 2:18 PM, Carl Meyer c...@oddbird.net wrote: It seems to me that there's a remarkable level of consensus developing here (though it may not look like it), and a small set of remaining open questions. The consensus (as I see it): - Migrate away from scraping external HTML pages, with package owners in control of the migration but a deadline for a forced switch, as outlined in Holger's PEP (with all appropriate caution and testing). - In some way, migrate to a situation where the popular installer tools install only release files from PyPI by default, but are capable of installing from other locations if the user provides an option. Perhaps I'm confused, but ISTM that every time I've said this, Donald and Lennart argue that it should not be possible to provide such an option -- or to be more specific, that PyPI should not publish the information that makes that option possible. If that's *not* the position they're taking, it'd be good to know, because we could totally stop arguing about it in that case. I don't know. At least the pre-PEP doesn't take the position to unconditionally ban external links. Maybe Lennart or Donald can they whether they oppose the moves outlined in the PEP. I'd hope that the perceived perfect doesn't become the enemy of the good here :) A) Leave external links in the PyPI simple index, but migrate the major tools to not use external links by default (i.e. Philip's plan to make allow-hosts=pypi the default in a future setuptools), with an option to turn them back on. I don't know who has proposed this option, but it's not me. You seem to be confusing external links and HTML-scraped links (rel= attributed links in /simple). The suggested behaviour of installers is not fully formulated yet in the PEP. We should improve that. I was the first person to propose disabling HTML-scraped links from PyPI *ASAP*. Yes, and thanks for pushing us in this direction. I still want them gone. That won't require tool changes, it just requires a rollout plan. Holger has one, let's work on that. The second thing I proposed is that new tools be developed to *assist* package authors in moving their files onto PyPI, so that future tool changes wouldn't result in widespread instances of people needing to set their tools to insecure settings just to get anything done. We need to get people's files moving onto PyPI *first*, in order to make changing the tool defaults practical. Indeed, it's a good idea to require the re-hosting or transfer tool ready before installers change their defaults. The *only* thing I object to is the part where some people want to ban external links from /simple, always and forever, regardless of the package authors' choice in the matter. I agree the package author should have a choice about the serving of links for their package. And installers should change defaults so that install-users have a choice as well, eventually, to control whether they are fine with crawling or using external links. B) Do a second PyPI migration, again with a per-package toggle and package owners in control, to a no external links in simple index setting. Consider for a moment how similar the end state here is with either A or B. In either case, by default users install only from PyPI, but by providing a special option they can install from some external source. (In B, that special option would be something like --find-links with a URL). In either case, we can continue to allow packages to register themselves on PyPI, be found in searches, etc, without uploading release files to PyPI if they prefer not to; they'll just have to provide special installation instructions to their users in that case. Not true: approach B means that you won't know what values to pass to the option. Yes and no: in the one case you need to specify --crawl or --use-external-links and in the other --find-links https://...; The latter requires reading the homepage for the correct URL or long_description of a package so is less obvious to install-users. It's also confused about an important point. All the links that appear in /simple are *already* completely under the package author's control. No new switches are required to remove external links - you can simply remove them from your releases' descriptions. This process could be made more transparent or easy, sure -- but it's a mistake to say that this is granting the package owners control that they don't already have. Right. I think allowing a package maintainer to say actually, please don't serve external links for my package (hosting mode pypi-only) is an easy expressive way to exert this control. What they lack control over is the rel= attributes, short of removing those links entirely. That's why I've proposed having a switch for that , as reflected in Holger's
[Catalog-sig] V3 PEP-draft for transitioning to pypi-hosting of release files
Hi all, after some more discussions and hours spend by Carl Meyer (who is now co-authoring the PEP) and me, here is a new V3 pre-submit draft. It is now more ambitious than the previous draft as should be obvious from the modified abstract (and Carl Meyers and Philip's earlier interactions on this list). There also are more details of how the current link-scraping works among other improvements and incorporations of feedback from discussions here. We intend to submit this draft tonight to the PEP editors. Feedback now and later remains welcome. I am sure there are issues to be sorted and clarified, among them the versioning-API suggestion by Marc-Andre. Thanks for everybody's support and feedback so far, holger PEP: XXX Title: Transitioning to release-file hosting on PyPI Version: $Revision$ Last-Modified: $Date$ Author: Holger Krekel hol...@merlinux.eu, Carl Meyer c...@oddbird.net Discussions-To: catalog-sig@python.org Status: Draft (PRE-submit V3) Type: Process Content-Type: text/x-rst Created: 10-Mar-2013 Post-History: Abstract This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, **no changes to distutils or existing installation tools are required in order to benefit from the transition phases, which is to result in faster, more reliable installs for most existing packages**. The first transition phase implements easy and explicit means for a package maintainter to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controling release file links. The second transition phase will result in the current PYPI index to only serve PYPI-hosted files by default. Externally hosted files will still be automatically discoverable through a second index. Present-day installation tools will be able to continue working by specifying this second index. New versions of installation tools shall default to only install packages from PYPI unless the user explicitely wishes to include non-PYPI sites. Rationale = .. _history: History and motivations for external hosting When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Philip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows: #. The PyPI ``simple/`` index for a package contains all links found anywhere in that package's metadata for any release. Links in the Download-URL and Home-page metadata fields are given ``rel=download`` and ``rel=homepage`` attributes, respectively. #. Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with basename in the form packagename-version.ARCHIVEEXT, is considered a potential installation candidate. #. Similarly, any links suffixed with an #egg=packagename-version fragment are considered an installation candidate. #. Additionally, the ``rel=homepage`` and ``rel=download`` links are followed and, if HTML, are themselves scraped for release-file links in the above formats. Today, most packages released on PyPI host their release files on PyPI, but a small percentage (XXX need updated data) rely on external hosting. There are many reasons [2]_ why people have chosen external hosting. To cite just a few: - release processes and scripts have been developed already and upload to external sites - it takes too long to upload large files from some places in the world - export restrictions e.g. for crypto-related software - company policies which require offering open source packages through own sites - problems with integrating uploading to PYPI into one's release process (because of release policies) - desiring download statistics different from those maintained by PyPI - perceived bad reliability of PYPI - not aware that PyPI offers file-hosting Irrespective of the present-day validity of these reasons, there clearly is a history why people choose to host files externally and it even was for some time the only way you could do things. Problem --- **Today, python package installers (pip, easy_install, buildout, and others) often need to query many non-PyPI URLs even if there are no externally hosted files**. Apart from querying pypi.python.org's simple index pages, also all homepages and download pages ever specified with any release of a package are crawled
Re: [Catalog-sig] V3 PEP-draft for transitioning to pypi-hosting of release files
On Wed, Mar 13, 2013 at 23:43 -0700, Nick Coghlan wrote: On Wed, Mar 13, 2013 at 5:16 PM, Carl Meyer c...@oddbird.net wrote: There is no instead of. There are parallel proposals (see the TUF thread) to improve the security of the ecosystem, and those proposals are not mutually exclusive with this one. If you search the PEP text, note that you don't find the words secure or security anywhere within it, or any claims of security achieved by this proposal alone. There is a brief mention of MITM attacks, which is relevant to the PEP because avoiding external link-crawling does reduce that attack surface, even if other proposals will also help with that (even more). Right, the changes to provide end-to-end security require more extensive changes and need to be given appropriate consideration before we proceed to implementation and deployment. This PEP, especially with the additional changes you propose here is an excellent approach to *near term* improvement, as a parallel effort to the more complex proposals. The /simple/ index will also be around for a long time for backwards compatibility reasons, regardless of any other changes that happen in the overall distribution ecosystem. I haven't followed the latest TUF discussions and related docs in depths yet but if those developments will regard simple/ as a deprecated interface, i think this PEP here should maybe not introduce simple/-with-externals as it will just make the situation more complicated for everyone to understand in a few months from now. best, holger Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
[Catalog-sig] V4 Pre-PEP: transition to release-file hosting on PYPI
Hi all, in particular Philip, Marc-Andre, Donald, Carl and me decided to simplify the PEP and avoid the somewhat awkward ``simple/-with-externals`` index for various reasons, among them Marc-Andre's criticisms. This also means present-day installation tools (shipped with Redhat/Debian/etc.) will continue to work as today for those packages which remain in a hosting-mode that requires crawling and scraping. They will still benefit from the fact that most packages will soon have a hosting-mode that avoids it. Future releases of installation tools will default to not perform crawling or using (scraped) external links, and new PYPI projects will default to only serve uploaded files. The V4 pre-PEP also renames the three PyPI hosting modes to be more descriptive. Since all three modes allow external links, pypi-ext vs pypi-only were misleading. The new naming distinguishes the mode that both scrapes links from metadata and crawls external pages for more links (pypi-scrape-crawl) from the mode that only scrapes links from metadata (pypi-scrape) from the mode where all links are explicit (pypi-explicit). Without the separate external index, it also turns out that the two transition phases are separated into PyPI changes (phase one) and installer-tool updates (phase two). There are no PyPI changes necessary in phase two. As stated in a new open question, it should be possible to do PEP-related installation tool updates during phase 1, that may require a bit of clarification in the PEP's language still. Carl and me are happy with this PEP version now and hope you all are as well. Donald is already working on improving the analysis tool so we hopefully have some updated numbers soon. cheers, Holger PEP: XXX Title: Transitioning to release-file hosting on PyPI Version: $Revision$ Last-Modified: $Date$ Author: Holger Krekel hol...@merlinux.eu, Carl Meyer c...@oddbird.net Discussions-To: catalog-sig@python.org Status: Draft (PRE-submit V4) Type: Process Content-Type: text/x-rst Created: 10-Mar-2013 Post-History: Abstract This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, **no changes to distutils or existing installation tools are required in order to benefit from the first transition phase, which will result in faster, more reliable installs for most existing packages**. The first transition phase implements an easy and explicit means for a package maintainer to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controlling release file links. The first phase also will make new projects on PYPI use a default to only serve links to release files which were uploaded to PYPI. The second transition phase concerns end-user installation tools, which shall default to only install release files that are hosted on PyPI and tell the user if external release files exist, offering a choice to automatically use those external files. Rationale = .. _history: History and motivations for external hosting When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Philip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows: #. The PyPI ``simple/`` index for a package contains all links found by scraping them from that package's long_description metadata for any release. Links in the Download-URL and Home-page metadata fields are given ``rel=download`` and ``rel=homepage`` attributes, respectively. #. Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with name in the form packagename-version.ARCHIVEEXT, is considered a potential installation candidate by installation tools. #. Similarly, any links suffixed with an #egg=packagename-version fragment are considered an installation candidate. #. Additionally, the ``rel=homepage`` and ``rel=download`` links are crawled by installation tools and, if HTML, are themselves scraped for release-file links in the above formats. Today, most packages released on PyPI host their release files on PyPI, but a small percentage (XXX need updated data) rely on external hosting. There are many reasons [2]_ why people have chosen external hosting. To cite just a few: - release processes and scripts have been developed already and upload to external sites
Re: [Catalog-sig] V4 Pre-PEP: transition to release-file hosting on PYPI
On Fri, Mar 15, 2013 at 11:15 -0400, PJ Eby wrote: Do we even need the internal/external rel info? I was planning to just use the URL hostname. i.e., are there any use cases for designating an externally-hosted file internal, or an internally-hosted file external? If not, it seems the rel= is redundant. It's also more work to implement, vs. just defaulting --allow-hosts to be the --index-url host; a strategy ISTM pip could also use, since it has the same two options available. Also, if we're not doing homepage/download crawling any more, I was hoping we could just drop the code that 'parses' rel= links in the first place, as it's an awkward ugly hack. ;-) We wanted to avoid requiring hostname-checking especially in light of parallel developments putting PYPI release files on a CDN, i.e. non pypi.python.org domains. The rel=internal communicates that this link is under control of the index server and the installer should not be worried and users need not know about allow-hosts etc. For example, Donald's https://crate.io is already operating in this manner and has its files on crate-cdn.com. best, holger ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] V4 Pre-PEP: transition to release-file hosting on PYPI
On Fri, Mar 15, 2013 at 22:01 -0400, PJ Eby wrote: On Fri, Mar 15, 2013 at 7:16 PM, Carl Meyer c...@oddbird.net wrote: Ok, pending agreement from Holger I'll make a change in the PEP to explicitly allow clients to make decisions based on either the rel attributes or based on hostnames. Would that be sufficient to address your concerns? Yes. I just don't want to be in a situation down the road where there's another argument about this on Catalog-SIG when PyPI starts using a CDN that, but it says this in the rel and you're supposed to use that, and I say, but Carl and Holger said... and they go, doesn't matter, PEP says ;-) This way, the PEP will be clear that supporting a split of PyPI's hostnames isn't in current scope. I am also okay with the PEP allowing *.indexhost instead of just indexhost as the filtering mechanism, as long as it specifies one *now*. (Again, so this doesn't have to be revisited later.) If somebody who knows something about CDNs, TUF, etc., needs to weigh in on it first, that's fine. I just want to know where things stand. One related question. The rel=internal links will contain a (md5 currently) hash so if the referenced resource resolves to a file matching that hash, we can be sure about its integrity. What kind of security does host-checking add on top? holger Putting the /simple/ API on a CDN isn't quite that easy because it currently involves some server-side redirects to effectively make project names case-insensitive. FWIW, easy_install works fine without this. If a matching index page isn't found, it checks the full package list. PyPI's redirection just reduces bandwidth usage and request overhead in the case where the case of the user's request doesn't match the actual package listing. But it could be completely static without affecting easy_install and tools that use its package-finding code. ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Updated PEP 438
Hi Richard, all, On Wed, Mar 20, 2013 at 17:30 -0700, Richard Jones wrote: I've pushed the latest PEP to the repos. It has all the recent clarifications and the API docs. Just need to wait for the website to rebuild or something. It's online now. Current references to PEP438 (also inlined below): http://www.python.org/dev/peps/pep-0438/ https://bitbucket.org/hpk42/pep-pypi/src/c0cbd3f3508991f5c47eb0fdb036c6e25ef45047/PEP-438.txt?at=default Unless there's any last-minute problems I'll accept the PEP in this form and push the implementation to the production PyPI next week after I fly home. testpypi.python.org keeps 502ing on me - probably makes sense to first have that stable and reviewed for a few days at least. best and thanks everybody, holger PEP: 438 Title: Transitioning to release-file hosting on PyPI Version: $Revision$ Last-Modified: $Date$ Author: Holger Krekel hol...@merlinux.eu, Carl Meyer c...@oddbird.net BDFL-Delegate: Richard Jones rich...@python.org Discussions-To: catalog-sig@python.org Status: Draft Type: Process Content-Type: text/x-rst Created: 15-Mar-2013 Post-History: Abstract This PEP proposes a backward-compatible two-phase transition process to speed up, simplify and robustify installing from the pypi.python.org (PyPI) package index. To ease the transition and minimize client-side friction, **no changes to distutils or existing installation tools are required in order to benefit from the first transition phase, which will result in faster, more reliable installs for most existing packages**. The first transition phase implements easy and explicit means for a package maintainer to control which release file links are served to present-day installation tools. The first phase also includes the implementation of analysis tools for present-day packages, to support communication with package maintainers and the automated setting of default modes for controlling release file links. The first phase also will default newly-registered projects on PyPI to only serve links to release files which were uploaded to PyPI. The second transition phase concerns end-user installation tools, which shall default to only install release files that are hosted on PyPI and tell the user if external release files exist, offering a choice to automatically use those external files. External release files shall in the future be registered together with a checksum hash so that installation tools can verify the integrity of the eventual download (PyPI-hosted release files always carry such a checksum). Alternative PyPI server implementations should implement the new simple index serving behaviour of transition phase 1 to avoid installation tools treating their release links as external ones in phase 2. Rationale = .. _history: History and motivations for external hosting When PyPI went online, it offered release registration but had no facility to host release files itself. When hosting was added, no automated downloading tool existed yet. When Philip Eby implemented automated downloading (through setuptools), he made the choice to allow people to use download hosts of their choice. The finding of externally-hosted packages was implemented as follows: #. The PyPI ``simple/`` index for a package contains all links found by scraping them from that package's long_description metadata for any release. Links in the Download-URL and Home-page metadata fields are given ``rel=download`` and ``rel=homepage`` attributes, respectively. #. Any of these links whose target is a file whose name appears to be in the form of an installable source or binary distribution, with name in the form packagename-version.ARCHIVEEXT, is considered a potential installation candidate by installation tools. #. Similarly, any links suffixed with an #egg=packagename-version fragment are considered an installation candidate. #. Additionally, the ``rel=homepage`` and ``rel=download`` links are crawled by installation tools and, if HTML, are themselves scraped for release-file links in the above formats. See the easy_install documentation for a complete description of this behavior. [1]_ Today, most packages indexed on PyPI host their release files on PyPI. Out of 29,117 total projects on PyPI, only 2,581 (less than 10%) include any links to installable files that are available only off-PyPI. [2]_ There are many reasons [3]_ why people have chosen external hosting. To cite just a few: - release processes and scripts have been developed already and upload to external sites - it takes too long to upload large files from some places in the world - export restrictions e.g. for crypto-related software - company policies which require offering open source packages through own sites - problems with integrating uploading to PyPI into one's release process (because of release policies) - desiring download
Re: [Catalog-sig] Replacement client for pep381client
On Wed, Mar 20, 2013 at 19:27 -0700, Christian Theune wrote: On 2013-03-20 23:59:21 +, Christian Theune said: I'm currently re-initializing my own mirror. This basically can be run in-place by just removing the existing state data and calling my sync script (bsn-mirror) instead of pep381run with the same parameters. This worked nicely for me - I'm running my mirror on bandersnatch now. I got so far 3 errors like this one:: 2013-03-21 14:23:19,759 bandersnatch.package INFO: Downloading: https://pypi.python.org/packages/source/C/Clay/Clay-0.13.tar.gz 2013-03-21 14:23:20,384 bandersnatch.package ERROR: Error syncing package: Coopr Traceback (most recent call last): File /home/hpk/bandersnatch/src/bandersnatch/package.py, line 50, in sync self.sync_release_files() File /home/hpk/bandersnatch/src/bandersnatch/package.py, line 68, in sync_release_files self.download_file(release_file['url'], release_file['md5_digest']) File /home/hpk/bandersnatch/src/bandersnatch/package.py, line 144, in download_file url, existing_hash, md5sum)) ValueError: https://pypi.python.org/packages/source/C/Coopr/Coopr-1.1.zip has hash 97cb7ae47656df10d243533c4f0c63c1 instead of 7ed6916702b2afccd254b423450ac4af and the command terminates. I can restart fine, though. Will continue to do continue and see how far i get. Seems to perform quickly, btw :) holger Christian ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Merge catalog-sig and distutils-sig
On Thu, Mar 28, 2013 at 14:22 -0400, Donald Stufft wrote: Is there much point in keeping catalog-sig and distutils-sig separate? It seems to me that most of the same people are on both lists, and the topics almost always have consequences to both sides of the coin. So much so that it's often hard to pick *which* of the two (or both) lists you post too. Further confused by the fact that distutils is hopefully someday going to go away :) +1 Not sure if there's some official process for requesting it or not, but I think we should merge the two lists and just make packaging-sig to umbrella the entire packaging topics. - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig
Re: [Catalog-sig] Merge catalog-sig and distutils-sig
On Thu, Mar 28, 2013 at 15:42 -0400, Donald Stufft wrote: On Mar 28, 2013, at 3:39 PM, PJ Eby p...@telecommunity.com wrote: On Thu, Mar 28, 2013 at 3:14 PM, Fred Drake f...@fdrake.net wrote: On Thu, Mar 28, 2013 at 2:22 PM, Donald Stufft don...@stufft.io wrote: Is there much point in keeping catalog-sig and distutils-sig separate? No. The last time this was brought up, there were objections, but I don't remember what they were. I'll let people who think there's a point worry about that. Not sure if there's some official process for requesting it or not, but I think we should merge the two lists and just make packaging-sig to umbrella the entire packaging topics. There is the meta-sig, but the description is out-dated: http://mail.python.org/mailman/listinfo/meta-sig and the last message in the archives is dated 2011, and sparked no discussion: http://mail.python.org/pipermail/meta-sig/2011-June.txt +1 on merging the lists. Can we do it by just dropping catalog-sig and keeping distutils-sig? I'm afraid we might lose some important distutils-sig population if the process involves renaming the list, resubscribing, etc. I also *really* don't want to invalidate archive links to the distutils-sig archive. All in all, +1 on not having two lists, but I'm really worried about breaking distutils-sig. We're still going to be talking about distribution utilities, after all. Don't care how it's done. I don't know Mailman enough to know what is possible or how easy things are. I thought packaging-sig sounded nice but if you can't rename + redirect or merge or something in mailman I'm down for whatever. I've moved lists even from external sites to python.org and renamed them (latest was pytest-dev). That part works nicely and people can continue to use the old ML address. Merging two lists however makes it harder to get redirects for the old archives. But why not just keep distutils-sig and catalog-sig archives, but have all their mail arrive at a new packaging-sig and begin a new archive for the latter? holger - Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig ___ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig