On May 27, 2013, at 4:34 PM, Noah Kantrowitz <n...@coderanger.net> wrote:

> 
> On May 27, 2013, at 1:20 PM, holger krekel wrote:
> 
>> On Mon, May 27, 2013 at 12:58 -0700, Noah Kantrowitz wrote:
>>> On May 27, 2013, at 12:18 PM, holger krekel wrote:
>>> 
>>>> On Mon, May 27, 2013 at 14:59 -0400, Donald Stufft wrote:
>>>>> On May 27, 2013, at 2:54 PM, holger krekel <hol...@merlinux.eu> wrote:
>>>>> 
>>>>>> On Mon, May 27, 2013 at 13:50 -0400, Donald Stufft wrote:
>>>>>>> On May 27, 2013, at 12:39 PM, Donald Stufft <don...@stufft.io> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On May 27, 2013, at 8:08 AM, holger krekel <hol...@merlinux.eu> wrote:
>>>>>>>> 
>>>>>>>>> Hi Noah, Donald, (CC also Richard, Christian),
>>>>>>>>> 
>>>>>>>>> i just checked with a test package and think we might have a cache
>>>>>>>>> consistency / changelog API problem.  It took me a while but here is 
>>>>>>>>> the basic thing: I uploaded a test package, changelog API reports it 
>>>>>>>>> has
>>>>>>>>> changed, then i go to its simple page, and some of the time the new 
>>>>>>>>> release
>>>>>>>>> file shows up, sometimes not.
>>>>>>>>> 
>>>>>>>>> Tools like bandersnatch, pep381 and devpi-server (and probably others)
>>>>>>>>> use PyPI's changelog API to determine if there are changes.  It seems
>>>>>>>>> those changes are signalled faster than they become consistently 
>>>>>>>>> accessible 
>>>>>>>>> through the CDN.  This can lead to inconsistent mirrors because when 
>>>>>>>>> the CDN has the files there is no change event anymore.  Such mirrors 
>>>>>>>>> are run by companies in-house so i think it's a real problem.
>>>>>>>>> 
>>>>>>>>> Even without mirroring there can be problems because installs are not
>>>>>>>>> directly repeatable: "pip install XYZ>=2.0" can give you first 2.0.1,
>>>>>>>>> then 2.0.0 a minute later.  I had hoped that a particular ip address
>>>>>>>>> sees things consistently.
>>>>>>>>> 
>>>>>>>>> I am not familiar with Fastly's caching properties -- can they notify
>>>>>>>>> about the fact that a page/file is consistently up-to-date 
>>>>>>>>> everywhere?  
>>>>>>>>> Or can the cache be globally invalidated for a particular page/file?
>>>>>>>>> Any other ideas?
>>>>>>>>> 
>>>>>>>>> Failing customizing Fastly usage and also maybe for the short term,
>>>>>>>>> is/could there be a special location provided by pypi.python.org which
>>>>>>>>> the above tools could use to get at the actual non-cached data?  We
>>>>>>>>> could then maybe mitigate the problem through updates of the 
>>>>>>>>> respective tools.
>>>>>>>>> That would at least solve the problem for one of my customers i think.
>>>>>>>>> 
>>>>>>>>> best,
>>>>>>>>> holger
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sun, May 26, 2013 at 10:34 -0700, Noah Kantrowitz wrote:
>>>>>>>>>> </farnsworth>
>>>>>>>>>> 
>>>>>>>>>> but seriously, at long last today it was my honor to throw the DNS 
>>>>>>>>>> switch to move PyPI to the Fastly caching CDN. I would like to thank 
>>>>>>>>>> Donald Stufft for doing much of the heavy lifting on the PyPI side, 
>>>>>>>>>> and to Fastly for graciously offering to host us. What does this 
>>>>>>>>>> mean for everyone? Well the biggest change is PyPI should get a 
>>>>>>>>>> whole lot faster. There are two major downsides however. There will 
>>>>>>>>>> now be a delay of several minutes in some cases between updating a 
>>>>>>>>>> package and having it be installable, and download counts will now 
>>>>>>>>>> be even more incorrect than they were before. The PyPI admins are 
>>>>>>>>>> discussing what to do about download counts long-term, but for now 
>>>>>>>>>> we all feel that the performance and availability benefits outweigh 
>>>>>>>>>> the loss. If anyone has any questions, or hears anything about 
>>>>>>>>>> issues with PyPI please don't hesitate to contact me.
>>>>>>>>>> 
>>>>>>>>>> --Noah
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Distutils-SIG maillist  -  Distutils-SIG@python.org
>>>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> Distutils-SIG maillist  -  Distutils-SIG@python.org
>>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>>>> 
>>>>>>>> I mentioned it on twitter but might as well mention it here as well.
>>>>>>>> 
>>>>>>>> Currently there is no invalidation going on. The effect on the 
>>>>>>>> mirroring was unanticipated and I'm currently getting the invalidation 
>>>>>>>> API setup within PyPI.
>>>>>>>> 
>>>>>>>> -----------------
>>>>>>>> Donald Stufft
>>>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 
>>>>>>>> 3372 DCFA
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> Distutils-SIG maillist  -  Distutils-SIG@python.org
>>>>>>>> http://mail.python.org/mailman/listinfo/distutils-sig
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> /simple/ Pages should now be immediately invalidated when a new package 
>>>>>>> is released.
>>>>>> 
>>>>>> thanks Donald.  Looking at the implementation, i wonder what happens if 
>>>>>> after ``self._conn.commit()`` a changelog API call arrives, returns 
>>>>>> changes
>>>>>> and a client uses it to retrieve changes before the fastly-purging takes 
>>>>>> place.  It's still a potential race-condition or am i missing something?
>>>>>> 
>>>>>> best,
>>>>>> holger
>>>>>> 
>>>>>>> -----------------
>>>>>>> Donald Stufft
>>>>>>> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 
>>>>>>> DCFA
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> There's no way around a race condition.
>>>>> 
>>>>> ``self._conn.commit()`` is what makes the changes available. If we purge 
>>>>> prior to committing it then if someone hits the page between the purge 
>>>>> and the self._conn.commit() then the client will see a page cached prior 
>>>>> to the update (while the change log will appear to be updated). 
>>>>> Essentially the same problem we have now.
>>>>> 
>>>>> The current implementation does mean that if a client happens to hit 
>>>>> between the commit and the purge they'll see old data however that's 
>>>>> pretty unlikely.
>>>> 
>>>> Purging can take a second and also depends on the network connectivity 
>>>> between pypi.python.org and fastly's api to begin with.   I am afraid 
>>>> the race-condition is bound to happen and then hard to detect.  
>>>> 
>>>> Not sure how exactly pypi.python.org is deployed but could commit() use
>>>> a semaphore which also the changelog-APIs use so that the latter only
>>>> returns after purging (and them some) has happened?  I don't think
>>>> mirrors would mind sometimes waiting a few seconds before the changelog* 
>>>> call
>>>> returns as long as the state is then consistent.
>>>> 
>>>> Lastly, i think introducing a bit of internal syncing overhead to commit()/
>>>> changelog should be ok because we have only few writes and hardly read 
>>>> load.
>>> 
>>> Mirroring should not be affected by caching at all, as new packages mean 
>>> new URLs (/pypi/name/version), so when you retrieve them there will be no 
>>> cache issues. 
>> 
>> The simple/PROJ pages are changed, not newly created.  (and yes,
>> new release files are not so much the problem because they are new
>> and thus retrieved from fastly on first access).
> 
> Yes, pep381client is fundamentally incompatible with the future of PyPI's 
> infrastructure. Sorry, this will not be changed at this point. If people 
> would like to continue to operate mirrors, they will need to transition to 
> use the API to access package information, fetch updated files, and rebuild 
> any relevant index data. For example, this is how Donald's crate.io mirror 
> operates. Using the current strategy of scraping the simple/ pages will 
> continue to work, you just need to retry failed requests until they succeed 
> (and check that the per-project pages match the version you expect from the 
> change log, consider it a failure if they do not). This is just a stopgap 
> though, and should not be considered a long-term solution.
> 
>> 
>>> What I think you mean is this makes a race condition for pep381client,
>>> however this is a bug in pep381client, not PyPI. If you would like to
>>> submit a patch for a Paxos-based replication protocol, I'm sure Donald
>>> and I would be happy to review it.
>> 
>> I am a bit lost of what you are talking about here.
>> 
>> The move to CDN broke things that worked before.  The changelog API
>> reported changes that could not be seen afterwards.  This remains true
>> after Donald's changes which just make it less likely but not impossible
>> to happen.
> 
> They worked effectively by accident, not because it was correct. Had I 
> understood how backwards the pep381 systems are, I would have alerted you all 
> sooner, I apologize for this lapse. I am happy to talk about how to correctly 
> use the PyPI API with anyone that has questions, or discuss more advanced 
> replication options that will be free of race conditions in a distributed 
> (yes, PyPI is a distributed database now) environment.
> 
> --Noah
> 

Just to assure folks. I do consider Mirroring a first class citizen and an 
important feature.

Better support will be coming for it. However this relatively minor issue 
requiring that clients get a little smarter to use the current system is a 
regrettable requirement to make installation from PyPI not suck for a large 
swathe of people.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Distutils-SIG maillist  -  Distutils-SIG@python.org
http://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to