On 2018-11-10 18:56:01, Daniel Lange wrote:
> Antoine,
>
> thank you very much for your filter-branch scripts.
you're welcome! glad it can be of use.
> I tested each:
>
> 1) the golang version:
> It completes after 3h36min:
>
> # git filter-branch --tree-filter '/split-by-year' HEAD
> Rewrite a09118bf0a33f3721c0b8f6880c4cbb1e407a39d (68282/68286) (12994 seconds
> passed, remaining 0 predicted)
> Ref 'refs/heads/master' was rewritten
>
> But it doesn't Close() the os.OpenFile handles so ...
> all data/CVE/list. files are 0 bytes long. Sic!
Well. That explains part of the performance difference. ;)
There were multiple problems with the golang source - variable shadowing
and, yes, a missing Close(). Surprisingly, the fixed version results is
*slower* than the equivalent Python code, taking about one second per
run or 1102 seconds for the last 1000 commits. I'm at a loss as to how I
managed to make go run slower than Python here (and can't help but think
C would have been easier, again). Probably poor programming on my
part. New version attached.
[...]
> 2.1) the Python version
> You claim #!/usr/bin/python3 in the shebang, so I tried that first:
>
> # git filter-branch --tree-filter '/usr/bin/python3
> /__pycache__/split-by-year.cpython-35.pyc' HEAD
> Rewrite 990d3c4bbb49308fb3de1e0e91b9ba5600386f8a (1220/68293) (41 seconds
> passed, remaining 2254 predicted)
> Traceback (most recent call last):
> File "split-by-year.py", line 13, in
> File "/usr/lib/python3.5/codecs.py", line 321, in decode
> (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 5463:
> invalid start byte
> tree filter failed: /usr/bin/python3 /__pycache__/split-by-year.cpython-35.pyc
I suspected this would be a problem, but didn't find any occurence in
the shallow clone so I forgot about it. Note that the golang version
takes great care to treat the data as binary...
> The offending commit is:
> * 990d3c4bbb - Rename sarge-checks data to something not specific to sarge,
> since we're working on etch now.
> Sorry for the probable annoyance, but it had to be done. (13 years ago)
> [Joey Hess]
>
> There will be many more like this, so for Python3
> this needs needs to be made unicode-agnostic.
... so I rewrote the thing to handle only binary and tested it against
that version of the file. It seems to work fine.
> Notice I compiled the .py to .pyc which makes it
> much faster and thus well usable.
Interesting. I didn't see much difference in performance in my
benchmarks on average, but the worst-case run did improve by 150ms, so I
guess this is worth the trouble. For those who didn't know (like me)
this means running:
python -m compileall bin/split-by-year.py
Whenever the .py file changes (right?).
> 2.2) Python, when a string was a string .. Python2
> Your code is actually Python2, so why not give that a try:
>
> # git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
> Rewrite b59da20b82011ffcfa6c4a453de9df58ee036b2c (2516/68293) (113 seconds
> passed, remaining 2954 predicted)
> Traceback (most recent call last):
> File "split-by-year.py", line 18, in
> yearly = 'data/CVE/list.{:d}'.format(year)
> NameError: name 'year' is not defined
> tree filter failed: /usr/bin/python2 /split-by-year.pyc
>
> The offending commit is:
> * b59da20b82 - claim (13 years ago) [Moritz Muehlenhoff]
> | diff --git a/data/CVE/list b/data/CVE/list
> | index 7b5d1d21d6..cdf0b74dd0 100644
> | --- a/data/CVE/list
> | +++ b/data/CVE/list
> | @@ -1,3 +1,4 @@
> | +begin claimed by jmm
> | CVE-2005-3276 (The sys_get_thread_area function in process.c in Linux 2.6
> before ...)
> | TODO: check
> | CVE-2005-3275 (The NAT code (1) ip_nat_proto_tcp.c and (2)
> ip_nat_proto_udp.c in ...)
> | @@ -34,6 +35,7 @@ CVE-2005-3260 (Multiple cross-site scripting (XSS)
> vulnerabilities in ...)
> | TODO: check
> | CVE-2005-3259 (Multiple SQL injection vulnerabilities in
> versatileBulletinBoard (vBB) ...)
> | TODO: check
> | +end claimed by jmm
> | CVE-2005- [Insecure caching of user id in mantis]
> | - mantis (bug #330682; unknown)
> | CVE-2005- [Filter information disclosure in mantis]
>
> As you see the line "+begin claimed by jmm" breaks the too simplistic parser
> logic.
> Unfortunately dry-running against a current version of data/CVE/list such
> errors do not show up.
> The "violations" of the file format are transient and buried in history.
Hmm... That's a trickier one. I guess we could just pretend that line
doesn't exist and drop it from history... But I chose to buffer it and
treat it like the CVE line so it gets attached to the right file. See if
it does what you expect.
git cat-file -p b59da20b82:data/CVE/list > data/CVE/list.b59da20b82
split-by-year.py data/CVE/list.b59da20b82
Performance-wise, I shaved off a surprising 60ms by enclosing all the
code in a function (yes