Re: [Catalog-sig] PyPI mirrors are all up to date

Tarek Ziadé Tue, 17 Apr 2012 03:51:17 -0700

On 4/17/12 11:57 AM, mar...@v.loewis.de wrote:

> by calculating the grand hash of each file hash.


In this case, the checksum would not be a reliable indication that the
files are actually up-to-date. For example, a mirror may keep updating
files into the wrong location (not the location that is then used to
serve the files), so that the files being served are from a stale copy.
This is not theoretical - it actually happened in my mirror setup at one
time.

So you were updating a directory but serving another directory ?

But then updating the right last-modified page people were seeing ?

In that case, updating the checksum would have revealed you were on thewrong set of files.

Unless you script was updating everything on a stale copy that was notpublished ?

That could take a few hours per change.

why that ? you don't calculate the checksum of a file your alreadyhave twice.


Even if you do, it's very fast to call md5.

try it:

$ find mirror | xargs md5

this takes a few seconds at most on the whole mirror


I tried it, and on my mirror, it took 27 minutes and 7 seconds.
So not exactly hours, but not "a few seconds" either.

oops sorry I ran it on the wrong directory, it's true that it takes moretime !

So on my centos 5 VM - which is quite slow and doing many other stufflike running Jenkins jobs, running the "md5deep" program like this :http://tarek.pastebin.mozilla.org/1574557

It took 15minutes and 1 second. It can be optimized of course, sincemost directories are done quickly and everything is in /source. Thattime can be divided by 2 at least with the proper load balancing betweena few md5 runners.

But that just to be run *once*. You would not compute it on every mirrorupdate but keep all md5 values somewhere.

So, recalculating the grand hash on every mirror update should takes afew seconds because it would just consist of calculating the hash forthe new files, thencalculating the grand hash -- a loop that updates a md5 hash with 20khashes takes less than a second if I don't count the file reading.


(see http://tarek.pastebin.mozilla.org/1574574)

I am not sure why we're having this discussion since it's implementationdetails, but it's fun :)

If there's interest I can write a multiprocess-based script that keeps amd5 database up-to-date


Cheers
Tarek


Regards,
Martin


_______________________________________________
Catalog-SIG mailing list
Catalog-SIG@python.org
http://mail.python.org/mailman/listinfo/catalog-sig

Re: [Catalog-sig] PyPI mirrors are all up to date

Reply via email to