On 2018-11-10 18:56:01, Daniel Lange wrote: > Antoine, > > thank you very much for your filter-branch scripts.
you're welcome! glad it can be of use. > I tested each: > > 1) the golang version: > It completes after 3h36min: > > # git filter-branch --tree-filter '/split-by-year' HEAD > Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds > passed, remaining 0 predicted) > Ref 'refs/heads/master' was rewritten > > But it doesn't Close() the os.OpenFile handles so ... > all data/CVE/list.yyyy files are 0 bytes long. Sic! Well. That explains part of the performance difference. ;) There were multiple problems with the golang source - variable shadowing and, yes, a missing Close(). Surprisingly, the fixed version results is *slower* than the equivalent Python code, taking about one second per run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I managed to make go run slower than Python here (and can't help but think C would have been easier, again). Probably poor programming on my part. New version attached. [...] > 2.1) the Python version > You claim #!/usr/bin/python3 in the shebang, so I tried that first: > > # git filter-branch --tree-filter '/usr/bin/python3 > /__pycache__/split-by-year.cpython-35.pyc' HEAD > Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds > passed, remaining 2254 predicted) > Traceback (most recent call last): > File "split-by-year.py", line 13, in <module> > File "/usr/lib/python3.5/codecs.py", line 321, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: > invalid start byte > tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc I suspected this would be a problem, but didn't find any occurence in the shallow clone so I forgot about it. Note that the golang version takes great care to treat the data as binary... > The offending commit is: > * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, > since we're working on etch now. > Sorry for the probable annoyance, but it had to be done. (13 years ago) > [Joey Hess] > > There will be many more like this, so for Python3 > this needs needs to be made unicode-agnostic. ... so I rewrote the thing to handle only binary and tested it against that version of the file. It seems to work fine. > Notice I compiled the .py to .pyc which makes it > much faster and thus well usable. Interesting. I didn't see much difference in performance in my benchmarks on average, but the worst-case run did improve by 150ms, so I guess this is worth the trouble. For those who didn't know (like me) this means running: python -m compileall bin/split-by-year.py Whenever the .py file changes (right?). > 2.2) Python, when a string was a string .. Python2 > Your code is actually Python2, so why not give that a try: > > # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD > Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds > passed, remaining 2954 predicted) > Traceback (most recent call last): > File "split-by-year.py", line 18, in <module> > yearly = 'data/CVE/list.{:d}'.format(year) > NameError: name 'year' is not defined > tree filter failed: /usr/bin/python2 /split-by-year.pyc > > The offending commit is: > * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff] > | diff --git a/data/CVE/list b/data/CVE/list > | index 7b5d1d21d6..cdf0b74dd0 100644 > | --- a/data/CVE/list > | +++ b/data/CVE/list > | @@ -1,3 +1,4 @@ > | +begin claimed by jmm > | CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 > before ...) > | TODO: check > | CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) > ip_nat_proto_udp.c in ...) > | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) > vulnerabilities in ...) > | TODO: check > | CVE-2005-3259 (Multiple SQL injection vulnerabilities in > versatileBulletinBoard (vBB) ...) > | TODO: check > | +end claimed by jmm > | CVE-2005-XXXX [Insecure caching of user id in mantis] > | - mantis <unfixed> (bug #330682; unknown) > | CVE-2005-XXXX [Filter information disclosure in mantis] > > As you see the line "+begin claimed by jmm" breaks the too simplistic parser > logic. > Unfortunately dry-running against a current version of data/CVE/list such > errors do not show up. > The "violations" of the file format are transient and buried in history. Hmm... That's a trickier one. I guess we could just pretend that line doesn't exist and drop it from history... But I chose to buffer it and treat it like the CVE line so it gets attached to the right file. See if it does what you expect. git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82 split-by-year.py data/CVE/list.b59da20b82 Performance-wise, I shaved off a surprising 60ms by enclosing all the code in a function (yes, it's crazy), but the buffering to deal with the above issue added another 40ms so performance should be similar. I'll start a run on the whole history to see if I can find any problems, as soon as a first clone finishes resolving those damn deltas. ;) Thanks for the review! A. -- Premature optimization is the root of all evil - Donald Knuth
#!/usr/bin/python3 import os import sys def main(path): fds = {} year = None buffer = b'' with open(path, 'rb') as source: for line in source: if line.startswith(b'CVE-'): buffer += line year = line.split(b'-')[1] year = int(year.decode('ascii', errors='surrogateescape')) elif year: yearly = 'data/CVE/list.{:d}'.format(year) target = fds.get(year, None) if target is None: fds[year] = target = open(yearly, 'ab') if buffer: target.write(buffer) buffer = b'' target.write(line) else: buffer += line for year, fd in fds.items(): fd.close() os.unlink(path) if __name__ == '__main__': path = 'data/CVE/list' if len(sys.argv) > 1: path = sys.argv[1] main(path)
package main import ( "bufio" "bytes" "io" "log" "os" "strconv" "strings" ) func main() { file, err := os.Open("data/CVE/list") if err != nil { log.Fatal(err) } defer file.Close() var ( line []byte cve []byte year uint64 year_str string target *os.File header bool ok bool ) fds := make(map[uint64]*os.File, 20) scanner := bufio.NewReader(file) for { line, err = scanner.ReadBytes('\n') if err != nil { break } if bytes.HasPrefix(line, []byte("CVE-")) { cve = line year_str = strings.Split(string(line), "-")[1] year, _ = strconv.ParseUint(year_str, 0, 0) header = true } else { if target, ok = fds[year]; !ok { target, err = os.Create("data/CVE/list." + year_str) if err != nil { log.Fatal(err) } fds[year] = target } if header { _, err := target.Write(cve) if err != nil { log.Println("error writing", string(cve), target, err) break } header = false } _, err := target.Write(line) if err != nil { log.Println("error writing", string(line), target, err) break } } } if err != io.EOF { log.Fatal(err) } os.Remove("data/CVE/list") }