On 2018-11-10 18:56:01, Daniel Lange wrote:
> Antoine,
>
> thank you very much for your filter-branch scripts.

you're welcome! glad it can be of use.

> I tested each:
>
> 1) the golang version:
> It completes after 3h36min:
>
> # git filter-branch --tree-filter '/split-by-year' HEAD
> Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds 
> passed, remaining 0 predicted)
> Ref 'refs/heads/master' was rewritten
>
> But it doesn't Close() the os.OpenFile handles so ...
> all data/CVE/list.yyyy files are 0 bytes long. Sic!

Well. That explains part of the performance difference. ;)

There were multiple problems with the golang source - variable shadowing
and, yes, a missing Close(). Surprisingly, the fixed version results is
*slower* than the equivalent Python code, taking about one second per
run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I
managed to make go run slower than Python here (and can't help but think
C would have been easier, again). Probably poor programming on my
part. New version attached.

[...]

> 2.1) the Python version
> You claim #!/usr/bin/python3 in the shebang, so I tried that first:
>
> # git filter-branch --tree-filter '/usr/bin/python3 
> /__pycache__/split-by-year.cpython-35.pyc' HEAD
> Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds 
> passed, remaining 2254 predicted)
>   Traceback (most recent call last):
>   File "split-by-year.py", line 13, in <module>
>   File "/usr/lib/python3.5/codecs.py", line 321, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463: 
> invalid start byte
> tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc

I suspected this would be a problem, but didn't find any occurence in
the shallow clone so I forgot about it. Note that the golang version
takes great care to treat the data as binary...

> The offending commit is:
> * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge, 
> since we're working on etch now.
>   Sorry for the probable annoyance, but it had to be done. (13 years ago) 
> [Joey Hess]
>
> There will be many more like this, so for Python3
> this needs needs to be made unicode-agnostic.

... so I rewrote the thing to handle only binary and tested it against
that version of the file. It seems to work fine.

> Notice I compiled the .py to .pyc which makes it
> much faster and thus well usable.

Interesting. I didn't see much difference in performance in my
benchmarks on average, but the worst-case run did improve by 150ms, so I
guess this is worth the trouble. For those who didn't know (like me)
this means running:

    python -m compileall bin/split-by-year.py

Whenever the .py file changes (right?).

> 2.2) Python, when a string was a string .. Python2
> Your code is actually Python2, so why not give that a try:
>
> # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
> Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds 
> passed, remaining 2954 predicted)
>   Traceback (most recent call last):
>   File "split-by-year.py", line 18, in <module>
>     yearly = 'data/CVE/list.{:d}'.format(year)
> NameError: name 'year' is not defined
> tree filter failed: /usr/bin/python2 /split-by-year.pyc
>
> The offending commit is:
> * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
> | diff --git a/data/CVE/list b/data/CVE/list
> | index 7b5d1d21d6..cdf0b74dd0 100644
> | --- a/data/CVE/list
> | +++ b/data/CVE/list
> | @@ -1,3 +1,4 @@
> | +begin claimed by jmm
> |  CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6 
> before ...)
> |       TODO: check
> |  CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2) 
> ip_nat_proto_udp.c in ...)
> | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS) 
> vulnerabilities in ...)
> |       TODO: check
> |  CVE-2005-3259 (Multiple SQL injection vulnerabilities in 
> versatileBulletinBoard (vBB) ...)
> |       TODO: check
> | +end claimed by jmm
> |  CVE-2005-XXXX [Insecure caching of user id in mantis]
> |       - mantis <unfixed> (bug #330682; unknown)
> |  CVE-2005-XXXX [Filter information disclosure in mantis]
>
> As you see the line "+begin claimed by jmm" breaks the too simplistic parser 
> logic.
> Unfortunately dry-running against a current version of data/CVE/list such 
> errors do not show up.
> The "violations" of the file format are transient and buried in history.

Hmm... That's a trickier one. I guess we could just pretend that line
doesn't exist and drop it from history... But I chose to buffer it and
treat it like the CVE line so it gets attached to the right file. See if
it does what you expect.

   git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82
   split-by-year.py data/CVE/list.b59da20b82

Performance-wise, I shaved off a surprising 60ms by enclosing all the
code in a function (yes, it's crazy), but the buffering to deal with the
above issue added another 40ms so performance should be similar.

I'll start a run on the whole history to see if I can find any problems,
as soon as a first clone finishes resolving those damn deltas. ;)

Thanks for the review!

A.

-- 
Premature optimization is the root of all evil
                        - Donald Knuth
#!/usr/bin/python3

import os
import sys


def main(path):
    fds = {}

    year = None
    buffer = b''
    with open(path, 'rb') as source:
        for line in source:
            if line.startswith(b'CVE-'):
                buffer += line
                year = line.split(b'-')[1]
                year = int(year.decode('ascii', errors='surrogateescape'))
            elif year:
                yearly = 'data/CVE/list.{:d}'.format(year)
                target = fds.get(year, None)
                if target is None:
                    fds[year] = target = open(yearly, 'ab')
                if buffer:
                    target.write(buffer)
                    buffer = b''
                target.write(line)
            else:
                buffer += line

    for year, fd in fds.items():
        fd.close()
    os.unlink(path)


if __name__ == '__main__':
    path = 'data/CVE/list'
    if len(sys.argv) > 1:
        path = sys.argv[1]
    main(path)
package main

import (
	"bufio"
	"bytes"
	"io"
	"log"
	"os"
	"strconv"
	"strings"
)

func main() {
	file, err := os.Open("data/CVE/list")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	var (
		line     []byte
		cve      []byte
		year     uint64
		year_str string
		target   *os.File
		header   bool
		ok       bool
	)
	fds := make(map[uint64]*os.File, 20)
	scanner := bufio.NewReader(file)
	for {
		line, err = scanner.ReadBytes('\n')
		if err != nil {
			break
		}
		if bytes.HasPrefix(line, []byte("CVE-")) {
			cve = line
			year_str = strings.Split(string(line), "-")[1]
			year, _ = strconv.ParseUint(year_str, 0, 0)
			header = true
		} else {
			if target, ok = fds[year]; !ok {
				target, err = os.Create("data/CVE/list." + year_str)
				if err != nil {
					log.Fatal(err)
				}
				fds[year] = target
			}
			if header {
				_, err := target.Write(cve)
				if err != nil {
					log.Println("error writing", string(cve), target, err)
					break
				}
				header = false
			}
			_, err := target.Write(line)
			if err != nil {
				log.Println("error writing", string(line), target, err)
				break
			}
		}
	}
	if err != io.EOF {
		log.Fatal(err)
	}
	os.Remove("data/CVE/list")
}

Reply via email to