Bug#908678: Some more thoughts and some tests on the security-tracker git repo

Antoine Beaupré Fri, 09 Nov 2018 13:09:14 -0800

On 2018-09-26 14:56:16, Daniel Lange wrote:

[...]


> In any case, a repo with just the split files but no maintained history clones
> in ~12s in the above test setup. It also brings the (bare) repo down from 
> 3,3GB
> to 189MB. So the issue is really the data/CVE/list file.

So I've looked in that problem as well, four months ago:

https://salsa.debian.org/security-tracker-team/security-tracker/issues/2

In there I proposed splitting the data/CVE/list file into "one file per
CVE". In retrospect, that was a rather naive approach and yielded all
sorts of problems: there were so many files that it create problems even
for the shell (argument list too long).

I hadn't thought of splitting things in "one *file* per year". That
could really help! Unfortunately, it's hard to simulate what it would
look like *14 years* from now (yes, that's how old that repo is
already).

I can think of two ways to simulate that:

 1. generate commits to recreate all files from scratch: parse
    data/CVE/list, split it up into chunks, and add each CVE in one
    separate commit. it's not *exactly* how things are done now, but it
    should be a close enough approximation

 2. do a crazy filter-branch to send commits to the right
    files. considering how long an initial clone takes, i can't even
    begin to imagine how long *that* would take. but it would be the
    most accurate simulation.

Short of that, I think it's somewhat dishonest to compare a clean
repository with split files against a repository with history over 14
years and thousands of commits. Intuitively, I think you're right and
that "sharding" the data in yearly packets would help a lot git's
performance. But we won't know until we simulate it, and if hit that
problem again 5 years from now, all that work will have been for
nothing. (Although it *would* give us 5 years...)

> That said, data/DSA/list is 14575 lines. That seems to not bother git too much
> yet. Still if things get re-structured, this file may be worth a look, too.

Yeah, I haven't had trouble with that one yet either.

> To me the most reasonable path forward unfortunately looks like start a new 
> repo
> for 2019+ and "just" import the split files or single-record files as 
> mentioned
> by pabs but not the git/svn/cvs history. The old repo would - of course - stay
> around but frozen at a deadline.

In any case, I personally don't think history over those files is that
critical. We rarely dig into that history because it's so
expensive... Any "git annotate" takes forever in this repo, and running
*that* it over data/CVE/list takes tens of minutes.

That said, once we pick a solution, we *could* craft a magic
filter-branch that *would* keep history. It might be worth eating that
performance cost then. I'll run some tests to see if I can make sense of
such a filter.

> Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab.
> That would reduce the pressure for some time.
> But cgit and other git frontends (as well as backends) we tested also struggle
> with the repo (which is why my company, Faster IT GmbH, used the 
> security-tracker
> repo as a very welcome test case in the first place).
> So that would buy time but not be a solution long(er) term.

Agreed. I think the benefits of hosting on gitlab outweigh the trouble
in rearchitecturing our datastore. As I said, it's not just gitlab
that's struggling with a 17MB text file: git itself has trouble dealing
with it as well, and I am often frustrated by that in my work...

A.

-- 
You are absolutely deluded, if not stupid, if you think that a
worldwide collection of software engineers who can't write operating
systems or applications without security holes, can then turn around
and suddenly write virtualization layers without security holes.
                        - Theo de Raadt

Bug#908678: Some more thoughts and some tests on the security-tracker git repo

Reply via email to