Hello,I'm attaching a script that is an initial attempt at doing this for the git side of things. Everything is done in bash at the moment. It does make use of gnu parallel (which is the parallel package in both Ubuntu and debian). I don't think anything else it uses isn't a standard tool in linux (except git).
Basically to run it do the following:1) clone cpython.git into the current directory (i.e. try not to have any "generated" files)
2) put scan.sh in the current directory and run it there What it does is the following:It checkouts out every 1000th commit (can be changed) going backwards on the current branch and computes the md5sum for every file (except those found in .git) and puts the md5sums in a file in a outdir/ directory that it creates. The names of the files are $num-$commit where the $num is the number of commits _backwards_ from the current commit (which makes sense if you think about iterating backwards from the current commit).
Running this on my laptop took ~11 minutes. I uploaded the output directory here in case you don't feel like running it:
http://thomasnyberg.com/outdir.tar.bz2(Ignore the frontpage of my "website". I'm obviously not all that concerned by it...)
In any case, this might be helpful for others in addition to myself. I figured it was best to email the list before continuing (maybe isn't really what's needed...). Possible things to add to this:
* doing something similar with comments * doing the same thing on all branches * maybe only compute the md5sum for changed files* little thought has gone into efficiency...there may be obvious gains hiding
Of course something similar would have to be run with the hg version and then a comparison would need to be done.
Hopefully this is helpful... Cheers, Thomas On 05/08/2016 07:28 PM, Senthil Kumaran wrote:
On Sun, May 8, 2016 at 4:12 PM, Émanuel Barry <vgr...@live.ca <mailto:vgr...@live.ca>> wrote: I understand that there's already a semi-official mirror of the cpython repo on GitHub, and I've been wondering why it isn't enough for our needs. It is suitable for our needs. Our last discussion was about how do we ascertain that cpython git repo has the same history as the hg repo, so that after migrate we do not loose any information from the old system. This could be done using: * check the number of commits in both repos for each branch * checking the hash of the source files in two repos. * (And do we go about validating each piece of commit log graph too)? If you have any suggestions, since you are using the cpython git mirror, please feel free to share your thoughts. Welcome to the party! Thanks, Senthil _______________________________________________ core-workflow mailing list core-workflow@python.org https://mail.python.org/mailman/listinfo/core-workflow This list is governed by the PSF Code of Conduct: https://www.python.org/psf/codeofconduct
scan.sh
Description: application/shellscript
_______________________________________________ core-workflow mailing list core-workflow@python.org https://mail.python.org/mailman/listinfo/core-workflow This list is governed by the PSF Code of Conduct: https://www.python.org/psf/codeofconduct