Antoine, thank you very much for your filter-branch scripts.
I tested each: 1) the golang version: It completes after 3h36min: # git filter-branch --tree-filter '/split-by-year' HEAD Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds passed, remaining 0 predicted) Ref 'refs/heads/master' was rewritten But it doesn't Close() the os.OpenFile handles so ... all data/CVE/list.yyyy files are 0 bytes long. Sic! I can reproduce that just running the golang executable against a current checkout of data/CVE/list. # go version go version go1.10.3 linux/amd64 (Stretch backport golang-go 2:1.10~5~bpo9+1) 2.1) the Python version You claim #!/usr/bin/python3 in the shebang, so I tried that first: # git filter-branch --tree-filter '/usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc' HEAD Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds passed, remaining 2254 predicted) Traceback (most recent call last): File "split-by-year.py", line 13, in <module> File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: invalid start byte tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc The offending commit is: * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, since we're working on etch now. Sorry for the probable annoyance, but it had to be done. (13 years ago) [Joey Hess] There will be many more like this, so for Python3 this needs needs to be made unicode-agnostic. Notice I compiled the .py to .pyc which makes it much faster and thus well usable. 2.2) Python, when a string was a string .. Python2 Your code is actually Python2, so why not give that a try: # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds passed, remaining 2954 predicted) Traceback (most recent call last): File "split-by-year.py", line 18, in <module> yearly = 'data/CVE/list.{:d}'.format(year) NameError: name 'year' is not defined tree filter failed: /usr/bin/python2 /split-by-year.pyc The offending commit is: * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff] | diff --git a/data/CVE/list b/data/CVE/list | index 7b5d1d21d6..cdf0b74dd0 100644 | --- a/data/CVE/list | +++ b/data/CVE/list | @@ -1,3 +1,4 @@ | +begin claimed by jmm | CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 before ...) | TODO: check | CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) ip_nat_proto_udp.c in ...) | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) vulnerabilities in ...) | TODO: check | CVE-2005-3259 (Multiple SQL injection vulnerabilities in versatileBulletinBoard (vBB) ...) | TODO: check | +end claimed by jmm | CVE-2005-XXXX [Insecure caching of user id in mantis] | - mantis <unfixed> (bug #330682; unknown) | CVE-2005-XXXX [Filter information disclosure in mantis] As you see the line "+begin claimed by jmm" breaks the too simplistic parser logic. Unfortunately dry-running against a current version of data/CVE/list such errors do not show up. The "violations" of the file format are transient and buried in history. Best, Daniel