On May 27, 2013, at 4:34 PM, Noah Kantrowitz <n...@coderanger.net> wrote:
> > On May 27, 2013, at 1:20 PM, holger krekel wrote: > >> On Mon, May 27, 2013 at 12:58 -0700, Noah Kantrowitz wrote: >>> On May 27, 2013, at 12:18 PM, holger krekel wrote: >>> >>>> On Mon, May 27, 2013 at 14:59 -0400, Donald Stufft wrote: >>>>> On May 27, 2013, at 2:54 PM, holger krekel <hol...@merlinux.eu> wrote: >>>>> >>>>>> On Mon, May 27, 2013 at 13:50 -0400, Donald Stufft wrote: >>>>>>> On May 27, 2013, at 12:39 PM, Donald Stufft <don...@stufft.io> wrote: >>>>>>> >>>>>>>> >>>>>>>> On May 27, 2013, at 8:08 AM, holger krekel <hol...@merlinux.eu> wrote: >>>>>>>> >>>>>>>>> Hi Noah, Donald, (CC also Richard, Christian), >>>>>>>>> >>>>>>>>> i just checked with a test package and think we might have a cache >>>>>>>>> consistency / changelog API problem. It took me a while but here is >>>>>>>>> the basic thing: I uploaded a test package, changelog API reports it >>>>>>>>> has >>>>>>>>> changed, then i go to its simple page, and some of the time the new >>>>>>>>> release >>>>>>>>> file shows up, sometimes not. >>>>>>>>> >>>>>>>>> Tools like bandersnatch, pep381 and devpi-server (and probably others) >>>>>>>>> use PyPI's changelog API to determine if there are changes. It seems >>>>>>>>> those changes are signalled faster than they become consistently >>>>>>>>> accessible >>>>>>>>> through the CDN. This can lead to inconsistent mirrors because when >>>>>>>>> the CDN has the files there is no change event anymore. Such mirrors >>>>>>>>> are run by companies in-house so i think it's a real problem. >>>>>>>>> >>>>>>>>> Even without mirroring there can be problems because installs are not >>>>>>>>> directly repeatable: "pip install XYZ>=2.0" can give you first 2.0.1, >>>>>>>>> then 2.0.0 a minute later. I had hoped that a particular ip address >>>>>>>>> sees things consistently. >>>>>>>>> >>>>>>>>> I am not familiar with Fastly's caching properties -- can they notify >>>>>>>>> about the fact that a page/file is consistently up-to-date >>>>>>>>> everywhere? >>>>>>>>> Or can the cache be globally invalidated for a particular page/file? >>>>>>>>> Any other ideas? >>>>>>>>> >>>>>>>>> Failing customizing Fastly usage and also maybe for the short term, >>>>>>>>> is/could there be a special location provided by pypi.python.org which >>>>>>>>> the above tools could use to get at the actual non-cached data? We >>>>>>>>> could then maybe mitigate the problem through updates of the >>>>>>>>> respective tools. >>>>>>>>> That would at least solve the problem for one of my customers i think. >>>>>>>>> >>>>>>>>> best, >>>>>>>>> holger >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, May 26, 2013 at 10:34 -0700, Noah Kantrowitz wrote: >>>>>>>>>> </farnsworth> >>>>>>>>>> >>>>>>>>>> but seriously, at long last today it was my honor to throw the DNS >>>>>>>>>> switch to move PyPI to the Fastly caching CDN. I would like to thank >>>>>>>>>> Donald Stufft for doing much of the heavy lifting on the PyPI side, >>>>>>>>>> and to Fastly for graciously offering to host us. What does this >>>>>>>>>> mean for everyone? Well the biggest change is PyPI should get a >>>>>>>>>> whole lot faster. There are two major downsides however. There will >>>>>>>>>> now be a delay of several minutes in some cases between updating a >>>>>>>>>> package and having it be installable, and download counts will now >>>>>>>>>> be even more incorrect than they were before. The PyPI admins are >>>>>>>>>> discussing what to do about download counts long-term, but for now >>>>>>>>>> we all feel that the performance and availability benefits outweigh >>>>>>>>>> the loss. If anyone has any questions, or hears anything about >>>>>>>>>> issues with PyPI please don't hesitate to contact me. >>>>>>>>>> >>>>>>>>>> --Noah >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Distutils-SIG maillist - Distutils-SIG@python.org >>>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Distutils-SIG maillist - Distutils-SIG@python.org >>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig >>>>>>>> >>>>>>>> I mentioned it on twitter but might as well mention it here as well. >>>>>>>> >>>>>>>> Currently there is no invalidation going on. The effect on the >>>>>>>> mirroring was unanticipated and I'm currently getting the invalidation >>>>>>>> API setup within PyPI. >>>>>>>> >>>>>>>> ----------------- >>>>>>>> Donald Stufft >>>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 >>>>>>>> 3372 DCFA >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Distutils-SIG maillist - Distutils-SIG@python.org >>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig >>>>>>> >>>>>>> >>>>>>> >>>>>>> /simple/ Pages should now be immediately invalidated when a new package >>>>>>> is released. >>>>>> >>>>>> thanks Donald. Looking at the implementation, i wonder what happens if >>>>>> after ``self._conn.commit()`` a changelog API call arrives, returns >>>>>> changes >>>>>> and a client uses it to retrieve changes before the fastly-purging takes >>>>>> place. It's still a potential race-condition or am i missing something? >>>>>> >>>>>> best, >>>>>> holger >>>>>> >>>>>>> ----------------- >>>>>>> Donald Stufft >>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 >>>>>>> DCFA >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> There's no way around a race condition. >>>>> >>>>> ``self._conn.commit()`` is what makes the changes available. If we purge >>>>> prior to committing it then if someone hits the page between the purge >>>>> and the self._conn.commit() then the client will see a page cached prior >>>>> to the update (while the change log will appear to be updated). >>>>> Essentially the same problem we have now. >>>>> >>>>> The current implementation does mean that if a client happens to hit >>>>> between the commit and the purge they'll see old data however that's >>>>> pretty unlikely. >>>> >>>> Purging can take a second and also depends on the network connectivity >>>> between pypi.python.org and fastly's api to begin with. I am afraid >>>> the race-condition is bound to happen and then hard to detect. >>>> >>>> Not sure how exactly pypi.python.org is deployed but could commit() use >>>> a semaphore which also the changelog-APIs use so that the latter only >>>> returns after purging (and them some) has happened? I don't think >>>> mirrors would mind sometimes waiting a few seconds before the changelog* >>>> call >>>> returns as long as the state is then consistent. >>>> >>>> Lastly, i think introducing a bit of internal syncing overhead to commit()/ >>>> changelog should be ok because we have only few writes and hardly read >>>> load. >>> >>> Mirroring should not be affected by caching at all, as new packages mean >>> new URLs (/pypi/name/version), so when you retrieve them there will be no >>> cache issues. >> >> The simple/PROJ pages are changed, not newly created. (and yes, >> new release files are not so much the problem because they are new >> and thus retrieved from fastly on first access). > > Yes, pep381client is fundamentally incompatible with the future of PyPI's > infrastructure. Sorry, this will not be changed at this point. If people > would like to continue to operate mirrors, they will need to transition to > use the API to access package information, fetch updated files, and rebuild > any relevant index data. For example, this is how Donald's crate.io mirror > operates. Using the current strategy of scraping the simple/ pages will > continue to work, you just need to retry failed requests until they succeed > (and check that the per-project pages match the version you expect from the > change log, consider it a failure if they do not). This is just a stopgap > though, and should not be considered a long-term solution. > >> >>> What I think you mean is this makes a race condition for pep381client, >>> however this is a bug in pep381client, not PyPI. If you would like to >>> submit a patch for a Paxos-based replication protocol, I'm sure Donald >>> and I would be happy to review it. >> >> I am a bit lost of what you are talking about here. >> >> The move to CDN broke things that worked before. The changelog API >> reported changes that could not be seen afterwards. This remains true >> after Donald's changes which just make it less likely but not impossible >> to happen. > > They worked effectively by accident, not because it was correct. Had I > understood how backwards the pep381 systems are, I would have alerted you all > sooner, I apologize for this lapse. I am happy to talk about how to correctly > use the PyPI API with anyone that has questions, or discuss more advanced > replication options that will be free of race conditions in a distributed > (yes, PyPI is a distributed database now) environment. > > --Noah > Just to assure folks. I do consider Mirroring a first class citizen and an important feature. Better support will be coming for it. However this relatively minor issue requiring that clients get a little smarter to use the current system is a regrettable requirement to make installation from PyPI not suck for a large swathe of people. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig