On 4/17/12 11:57 AM, mar...@v.loewis.de wrote:
> by calculating the grand hash of each file hash.

In this case, the checksum would not be a reliable indication that the
files are actually up-to-date. For example, a mirror may keep updating
files into the wrong location (not the location that is then used to
serve the files), so that the files being served are from a stale copy.
This is not theoretical - it actually happened in my mirror setup at one
time.

So you were updating a directory but serving another directory ?

But then updating the right last-modified page people were seeing ?

In that case, updating the checksum would have revealed you were on the wrong set of files.

Unless you script was updating everything on a stale copy that was not published ?


That could take a few hours per change.
why that ? you don't calculate the checksum of a file your already have twice.

Even if you do, it's very fast to call md5.

try it:

$ find mirror | xargs md5

this takes a few seconds at most on the whole mirror

I tried it, and on my mirror, it took 27 minutes and 7 seconds.
So not exactly hours, but not "a few seconds" either.
oops sorry I ran it on the wrong directory, it's true that it takes more time !

So on my centos 5 VM - which is quite slow and doing many other stuff like running Jenkins jobs, running the "md5deep" program like this : http://tarek.pastebin.mozilla.org/1574557

It took 15minutes and 1 second. It can be optimized of course, since most directories are done quickly and everything is in /source. That time can be divided by 2 at least with the proper load balancing between a few md5 runners.

But that just to be run *once*. You would not compute it on every mirror update but keep all md5 values somewhere.

So, recalculating the grand hash on every mirror update should takes a few seconds because it would just consist of calculating the hash for the new files, then calculating the grand hash -- a loop that updates a md5 hash with 20k hashes takes less than a second if I don't count the file reading.

(see http://tarek.pastebin.mozilla.org/1574574)

I am not sure why we're having this discussion since it's implementation details, but it's fun :)

If there's interest I can write a multiprocess-based script that keeps a md5 database up-to-date

Cheers
Tarek


Regards,
Martin



_______________________________________________
Catalog-SIG mailing list
Catalog-SIG@python.org
http://mail.python.org/mailman/listinfo/catalog-sig

Reply via email to