I recently released The BFG Repo-Cleaner, a new tool for cleansing bad
data out of Git repository histories. The BFG is typically at least
10-50x faster than git-filter-branch at these tasks:

* Removing Crazy Big Files from repo history
* Removing Passwords, Credentials & other Private data


As an example, these are timings for deleting an arbitrary file from
the large GCC repository (148495 commits):

The BFG : 3m29s
$ bfg -D README-fixinc

git filter-branch : 472m31s
$ git filter-branch --index-filter 'git rm --cached --ignore-unmatch
gcc/README-fixinc' --prune-empty --tag-name-filter cat -- --all

(roughly a 135x speed increase, reducing the task of processing a
large codebase from an overnight job to the work of a few minutes....
all timings done in a 4GB tmpfs ramdisk)

The BFG has some simple but very powerful command-line options, which
perform at similar speed:

remove all blobs bigger than 1 megabyte :
$ bfg --strip-blobs-bigger-than 1M  my-repo.git

replace all passwords (listed in a file 'passwords.txt') with ***REMOVED*** :
$ bfg --replace-banned-strings passwords.txt  my-repo.git

The main source of the BFG's performance advantage comes from
preventing repeated examination of the same tree objects. The approach
of git-filter-branch performs filtering for each commit, against the
complete file-hierarchy of each commit, one after the other, even
though commit trees are largely very similar. For the use-cases of The
BFG that's unnecessary- we don't care where, and in which commit, a
'bad' file exists - we just want it dealt with. Consequently the BFG
processes the Git object db on a memoised tree-by-tree basis,
processing each and every file & folder exactly once - the final
processing of the commit hierarchy is very quick. This _does_ mean
that it's not possible to delete files based on their absolute path
within the repo, but they can deleted based on their filename,
blob-id, or contents. This, and multi-core processing by default,
gives the dramatic speed-up while still providing the same results.
There's more performance data here:

I'd welcome feedback, and if anyone has cause to filter a repository's
history in future, I'd appreciate you giving the BFG a try and letting
me know how you found it.

Roberto Tyley
software dev @ The Guardian

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to