On 4/17/12 11:57 AM, mar...@v.loewis.de wrote:
> by calculating the grand hash of each file hash.
In this case, the checksum would not be a reliable indication that the
files are actually up-to-date. For example, a mirror may keep updating
files into the wrong location (not the location that is then used to
serve the files), so that the files being served are from a stale copy.
This is not theoretical - it actually happened in my mirror setup at one
time.
So you were updating a directory but serving another directory ?
But then updating the right last-modified page people were seeing ?
In that case, updating the checksum would have revealed you were on the
wrong set of files.
Unless you script was updating everything on a stale copy that was not
published ?
That could take a few hours per change.
why that ? you don't calculate the checksum of a file your already
have twice.
Even if you do, it's very fast to call md5.
try it:
$ find mirror | xargs md5
this takes a few seconds at most on the whole mirror
I tried it, and on my mirror, it took 27 minutes and 7 seconds.
So not exactly hours, but not "a few seconds" either.
oops sorry I ran it on the wrong directory, it's true that it takes more
time !
So on my centos 5 VM - which is quite slow and doing many other stuff
like running Jenkins jobs, running the "md5deep" program like this :
http://tarek.pastebin.mozilla.org/1574557
It took 15minutes and 1 second. It can be optimized of course, since
most directories are done quickly and everything is in /source. That
time can be divided by 2 at least with the proper load balancing between
a few md5 runners.
But that just to be run *once*. You would not compute it on every mirror
update but keep all md5 values somewhere.
So, recalculating the grand hash on every mirror update should takes a
few seconds because it would just consist of calculating the hash for
the new files, then
calculating the grand hash -- a loop that updates a md5 hash with 20k
hashes takes less than a second if I don't count the file reading.
(see http://tarek.pastebin.mozilla.org/1574574)
I am not sure why we're having this discussion since it's implementation
details, but it's fun :)
If there's interest I can write a multiprocess-based script that keeps a
md5 database up-to-date
Cheers
Tarek
Regards,
Martin
_______________________________________________
Catalog-SIG mailing list
Catalog-SIG@python.org
http://mail.python.org/mailman/listinfo/catalog-sig