On 2018-09-26 14:56:16, Daniel Lange wrote: [...]
> In any case, a repo with just the split files but no maintained history clones > in ~12s in the above test setup. It also brings the (bare) repo down from > 3,3GB > to 189MB. So the issue is really the data/CVE/list file. So I've looked in that problem as well, four months ago: https://salsa.debian.org/security-tracker-team/security-tracker/issues/2 In there I proposed splitting the data/CVE/list file into "one file per CVE". In retrospect, that was a rather naive approach and yielded all sorts of problems: there were so many files that it create problems even for the shell (argument list too long). I hadn't thought of splitting things in "one *file* per year". That could really help! Unfortunately, it's hard to simulate what it would look like *14 years* from now (yes, that's how old that repo is already). I can think of two ways to simulate that: 1. generate commits to recreate all files from scratch: parse data/CVE/list, split it up into chunks, and add each CVE in one separate commit. it's not *exactly* how things are done now, but it should be a close enough approximation 2. do a crazy filter-branch to send commits to the right files. considering how long an initial clone takes, i can't even begin to imagine how long *that* would take. but it would be the most accurate simulation. Short of that, I think it's somewhat dishonest to compare a clean repository with split files against a repository with history over 14 years and thousands of commits. Intuitively, I think you're right and that "sharding" the data in yearly packets would help a lot git's performance. But we won't know until we simulate it, and if hit that problem again 5 years from now, all that work will have been for nothing. (Although it *would* give us 5 years...) > That said, data/DSA/list is 14575 lines. That seems to not bother git too much > yet. Still if things get re-structured, this file may be worth a look, too. Yeah, I haven't had trouble with that one yet either. > To me the most reasonable path forward unfortunately looks like start a new > repo > for 2019+ and "just" import the split files or single-record files as > mentioned > by pabs but not the git/svn/cvs history. The old repo would - of course - stay > around but frozen at a deadline. In any case, I personally don't think history over those files is that critical. We rarely dig into that history because it's so expensive... Any "git annotate" takes forever in this repo, and running *that* it over data/CVE/list takes tens of minutes. That said, once we pick a solution, we *could* craft a magic filter-branch that *would* keep history. It might be worth eating that performance cost then. I'll run some tests to see if I can make sense of such a filter. > Corsac also mentioned on IRC that the repo could be hosted outside of Gitlab. > That would reduce the pressure for some time. > But cgit and other git frontends (as well as backends) we tested also struggle > with the repo (which is why my company, Faster IT GmbH, used the > security-tracker > repo as a very welcome test case in the first place). > So that would buy time but not be a solution long(er) term. Agreed. I think the benefits of hosting on gitlab outweigh the trouble in rearchitecturing our datastore. As I said, it's not just gitlab that's struggling with a 17MB text file: git itself has trouble dealing with it as well, and I am often frustrated by that in my work... A. -- You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes. - Theo de Raadt