On 2018-11-09 16:05:06, Antoine Beaupré wrote: > 2. do a crazy filter-branch to send commits to the right > files. considering how long an initial clone takes, i can't even > begin to imagine how long *that* would take. but it would be the > most accurate simulation. > > Short of that, I think it's somewhat dishonest to compare a clean > repository with split files against a repository with history over 14 > years and thousands of commits. Intuitively, I think you're right and > that "sharding" the data in yearly packets would help a lot git's > performance. But we won't know until we simulate it, and if hit that > problem again 5 years from now, all that work will have been for > nothing. (Although it *would* give us 5 years...)
So I've done that craaaazy filter-branch, on a shallow clone (1000 commits). The original clone is about 30MB, but the split repo is only 4MB. Cloning the original repo takes a solid 30+ seconds: [1221]anarcat@curie:src130$ time git clone file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test Clonage dans 'security-tracker-1000.orig-test'... remote: Énumération des objets: 5291, fait. remote: Décompte des objets: 100% (5291/5291), fait. remote: Compression des objets: 100% (1264/1264), fait. remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157) Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait. Résolution des deltas: 100% (3157/3157), fait. 64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 200056maxresident)k 0inputs+58968outputs (0major+48449minor)pagefaults 0swaps Cloning the split repo takes less than a second: [1223]anarcat@curie:src$ time git clone file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test Clonage dans 'security-tracker-1000-filtered-test'... remote: Énumération des objets: 2214, fait. remote: Décompte des objets: 100% (2214/2214), fait. remote: Compression des objets: 100% (1190/1190), fait. remote: Total 2214 (delta 936), réutilisés 2214 (delta 936) Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait. Résolution des deltas: 100% (936/936), fait. 0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k 0inputs+8664outputs (0major+3678minor)pagefaults 0swaps So this is clearly a win, and I think it would be possible to rewrite the history using the filter-branch command. Commit IDs would change, but we would keep all commits and so annotate and all that good stuff would still work. The split-by-year bash script was too slow for my purposes: it was taking a solid 15 seconds for each run, which meant it would have taken 9 *days* to process the entire repository. So I tried to see if this could be optimized, so we could split the file while keeping history without having to shutdown the whole system for days. I first rewrote it in Python, which processed the 1000 commits in 801 seconds. This gives an estimate of 15 hours for the 68278 commits I had locally. Concerned about the Python startup time, I then tried golang, which processed the tree in 262 seconds, giving final estimate of 4.8 hours. Attached are both implementations, for those who want to reproduce my results. Note that they differ from the original implementation in that they have to (naturally) remove the data/CVE/list file itself otherwise it's kept in history. Here's how to call it: git -c commit.gpgSign=false filter-branch --tree-filter '/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD Also observe how all gpg commit signatures are (obviously) lost. I have explicitely disabled that because those actually take a long time to compute... I haven't tested if a graft would improve performance, but I suspect it would not, given the sheer size of the repository that would effectively need to be carried over anyways. A. -- Man really attains the state of complete humanity when he produces, without being forced by physical need to sell himself as a commodity. - Ernesto "Che" Guevara
package main import ( "bufio" "bytes" "io" "log" "os" "strconv" "strings" ) func main() { file, err := os.Open("data/CVE/list") if err != nil { log.Fatal(err) } defer file.Close() var ( line []byte cve []byte year uint64 year_str string target *os.File header bool ) fds := make(map[uint64]*os.File, 20) scanner := bufio.NewReader(file) for { line, err = scanner.ReadBytes('\n') if bytes.HasPrefix(line, []byte("CVE-")) { cve = line year_str = strings.Split(string(line), "-")[1] year, _ = strconv.ParseUint(year_str, 0, 0) header = true } else { if target, ok := fds[year]; !ok { target, err = os.OpenFile("data/CVE/list."+year_str, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) if err != nil { log.Fatal(err) } fds[year] = target } if header { target.Write(cve) header = false } target.Write(line) } if err != nil { break } } if err != io.EOF { log.Fatal(err) } os.Remove("data/CVE/list") }
#!/usr/bin/python3 import os data = 'data/CVE/list' fds = {} with open(data) as source: for line in source: if line.startswith('CVE-'): cve = line year = int(line.split('-')[1]) else: yearly = 'data/CVE/list.{:d}'.format(year) target = fds.get(year, None) if target is None: fds[year] = target = open(yearly, 'a') if cve: target.write(cve) cve = None target.write(line) for year, fd in fds.items(): fd.close() os.unlink(data)