On 5/9/2014 2:12 PM, Donald Stufft wrote:

On May 9, 2014, at 1:28 PM, R. David Murray <rdmur...@bitdance.com> wrote:

I don't understand this.  Why it is our responsibility to provide a
free service for a large project to repeatedly download a set of files
they need?  Why does it not make more sense for them to download them
once, and only update their local copies when they change?  That's almost
completely orthogonal to making the service we do provide reliable.

Well here’s the thing right. The large projects repeatedly downloading the
same set of files is a canary. If any particular project goes uninstallable on
PyPI (or if PyPI itself goes down) then nobody can install it, the people
installing things over and over every day or the people who just happened
to be installing it during that downtime. However intermittent failures and
general insatiability is going to be noticed by the projects who install things
over and over again quicker and easier and thus it becomes a lot easier
to use them as a general gauge for what the average “uptime” is.

I have had the same question as David, so I also appreciate your answer.

IOW if PyPI goes unavailable for 10 minutes 5 times a day, you might get
a handful of “small” installers (e.g. not the big projects) in each downtime,
but a different set who are likely to shrug it off and just call treat it as the
norm even though it’s very disruptive to what they’re doing. However the
big project is highly likely to hit every single one of those downtimes and
be able to say “wow PyPI is flaky as hell”.

To expand further on that if we assume that we want ``pip install <foo>``
to be reliable and not work sometimes and work at other times then we’re
aiming for as high as uptime as possible. PyPI gets enough traffic that
any single large project isn’t a noticeable drop in our bucket and due to the
way our caching works it actually helps us to be faster and more reliable
to have people constantly hitting packages because they’ll be in cache
and able to be served without hitting the Origin servers.

Just for the record, PyPI gets roughly 350 req/s basically 24/7, in the
month of April we served 71.4TB of data with 877.4 million requests of
which 80.5% never made it to the actual servers that run PyPI and were
served directly out of the geo distributed CDN that sits in front of PyPI. We
are vastly better positioned to maintain a reliable infrastructure than ask
that every large project that uses Python to do the same.

The reason that it’s our responsibility for providing it is because we chose
to provide it. There isn’t a moral imperative to run PyPI, but running PyPI
badly seems like a crummy thing to do.

Agreed.

For perspective, Gentoo requests that people only do an emerge sync at
most once a day, and if they have multiple machines to update, that they
only do one pull, and they update the rest of their infrastructure from
their local copy.

To be clear, there are other reasons to run a local mirror but I don’t think 
that
it’s reasonable to expect anyone who wants a reliable install using pip to
stand up their own infrastructure.

Ok, you are not saying that caching is bad, but that having everyone reinvent caching, and possibly doing it badly, or at least not in thebest way, is bad.

Further to this point here I’m currently working on adding caching by default
for pip so that we minimize how often different people hit PyPI and we do it
automatically and in a way that doesn’t generally require people to think about
it and that also doesn’t require them to stand up their own infra.

This seems like the right solution. It would sort of make each machine a micro-CDN node.


--
Terry Jan Reedy


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to