Re: [PATCH 00/23] Multi-pack-index (MIDX)
On 6/7/2018 10:45 AM, Ævar Arnfjörð Bjarmason wrote: On Thu, Jun 07 2018, Derrick Stolee wrote: To test the performance in this situation, I created a script that organizes the Linux repository in a similar fashion. I split the commit history into 50 parts by creating branches on every 10,000 commits of the first- parent history. Then, `git rev-list --objects A ^B` provides the list of objects reachable from A but not B, so I could send that to `git pack-objects` to create these "time-based" packfiles. With these 50 packfiles (deleting the old one from my fresh clone, and deleting all tags as they were no longer on-disk) I could then test 'git rev-list --objects HEAD^{tree}' and see: Before: 0.17s After: 0.13s % Diff: -23.5% By adding logic to count hits and misses to bsearch_pack, I was able to see that the command above calls that method 266,930 times with a hit rate of 33%. The MIDX has the same number of calls with a 100% hit rate. Do you have the script you used for this? It would be very interesting as something we could stick in t/perf/ to test this use-case in the future. How does this & the numbers below compare to just a naïve --max-pack-size= on linux.git? Is it possible for you to tar this test repo up and share it as a one-off? I've been polishing the core.validateAbbrev series I have, and it would be interesting to compare some of the (abbrev) numbers. Here is what I used. You will want to adjust your constants for whatever repo you are using. This is for the Linux kernel which has a first-parent history of ~50,000 commits. It also leaves a bunch of extra files around, so it is nowhere near incorporating into the code. #!/bin/bash for i in `seq 1 50` do ORDER=$((51 - $i)) NUM_BACK=$((1000 * ($i - 1))) echo creating batch/$ORDER git branch -f batch/$ORDER HEAD~$NUM_BACK echo batch/$ORDER git rev-parse batch/$ORDER done lastbranch="" for i in `seq 1 50` do branch=batch/$i if [$lastbranch -eq ""] then echo "$branch" git rev-list --objects $branch | sed 's/ .*//' >objects-$i.txt else echo "$lastbranch" echo "$branch" git rev-list --objects $branch ^$lastbranch | sed 's/ .*//' >objects-$i.txt fi git pack-objects --no-reuse-delta .git/objects/pack/branch-split2 lastbranch=$branch done for tag in `git tag --list` do git tag -d $tag done rm -rf .git/objects/pack/pack-* git midx write
Re: [PATCH 00/23] Multi-pack-index (MIDX)
On Thu, Jun 07 2018, Derrick Stolee wrote: > To test the performance in this situation, I created a > script that organizes the Linux repository in a similar > fashion. I split the commit history into 50 parts by > creating branches on every 10,000 commits of the first- > parent history. Then, `git rev-list --objects A ^B` > provides the list of objects reachable from A but not B, > so I could send that to `git pack-objects` to create > these "time-based" packfiles. With these 50 packfiles > (deleting the old one from my fresh clone, and deleting > all tags as they were no longer on-disk) I could then > test 'git rev-list --objects HEAD^{tree}' and see: > > Before: 0.17s > After: 0.13s > % Diff: -23.5% > > By adding logic to count hits and misses to bsearch_pack, > I was able to see that the command above calls that > method 266,930 times with a hit rate of 33%. The MIDX > has the same number of calls with a 100% hit rate. Do you have the script you used for this? It would be very interesting as something we could stick in t/perf/ to test this use-case in the future. How does this & the numbers below compare to just a naïve --max-pack-size= on linux.git? Is it possible for you to tar this test repo up and share it as a one-off? I've been polishing the core.validateAbbrev series I have, and it would be interesting to compare some of the (abbrev) numbers. > Abbreviation Speedups > - > > To fully disambiguate an abbreviation, we must iterate > through all packfiles to ensure no collision exists in > any packfile. This requires O(P log N) time. With the > MIDX, this is only O(log N) time. Our standard test [2] > is 'git log --oneline --parents --raw' because it writes > many abbreviations while also doing a lot of other work > (walking commits and trees to compute the raw diff). > > For a copy of the Linux repository with 50 packfiles > split by time, we observed the following: > > Before: 100.5 s > After: 58.2 s > % Diff: -59.7% > > > Request for Review Attention > > > I tried my best to take the feedback from the commit-graph > feature and apply it to this feature. I also worked to > follow the object-store refactoring as I could. I also have > some local commits that create a 'verify' subcommand and > integrate with 'fsck' similar to the commit-graph, but I'll > leave those for a later series (and review is still underway > for that part of the commit-graph). > > One place where I could use some guidance is related to the > current state of 'the_hash_algo' patches. The file format > allows a different "hash version" which then indicates the > length of the hash. What's the best way to ensure this > feature doesn't cause extra pain in the hash-agnostic series? > This will inform how I go back and make the commit-graph > feature better in this area, too. > > > Thanks, > -Stolee
Re: [PATCH 00/23] Multi-pack-index (MIDX)
On 6/7/2018 10:03 AM, Derrick Stolee wrote: This patch series includes a rewrite of the previous multi-pack-index RFC [1] using the feedback from the commit-graph feature. Sorry to everyone who got a duplicate copy of this series. I misspelled 'kernel.org' and it didn't go to the list. I also have this series available as a GitHub PR [1] [1] https://github.com/derrickstolee/git/pull/7
[PATCH 00/23] Multi-pack-index (MIDX)
This patch series includes a rewrite of the previous multi-pack-index RFC [1] using the feedback from the commit-graph feature. I based this series on 'next' as it requires the recent object-store patches. The multi-pack-index (MIDX) is explained fully in the design document 'Documentation/technical/midx.txt'. The short description is that the MIDX stores the information from all of the IDX files in a pack directory. The crucial design decision is that the IDX files still exist, so we can fall back to the IDX files if there is any issue with the MIDX (or core.midx is set to false, or a user downgrades Git, etc.) The MIDX feature has been part of our GVFS releases for a few months (since the RFC). It has behaved well, indexing over 31 million commits and trees across up to 250 packfiles. These MIDX files are nearly 1GB in size and take ~20 seconds to rewrite when adding new IDX information. This ~20s mark is something I'd like to improve, and I mention how to make the file incremental (similar to split-index) in the design document. I also want to make the commit-graph file incremental, so I'd like to do that at the same time after both the MIDX and commit-graph are stable. Lookup Speedups --- When looking for an object, Git uses an most-recently- used (MRU) cache of packfiles. This does pretty well to minimize the number of misses when searching through packfiles for an object, especially if there is one "big" packfile that contains most of the objets (so it will rarely miss and is usually one of the first two packfiles in the list). The MIDX does provide a way to remove these misses, improving lookup time. However, this lookup time greatly depends on the arrangement of the packfiles. For instance, if you take the Linux repository and repack using `git repack -adfF --max-pack-size=128m` then all commits will be in one packfile, all trees will be in a small set of packfiles and organized well so 'git rev-list --objects HEAD^{tree}' only inspects one or two packfiles. GVFS has the notion of a "prefetch packfile". These are packfiles that are precomputed by cache servers to contain the commits and trees introduced to the remote each day. GVFS downloads these packfiles and places them in an alternate. Since these are organized by "first time introduced" and the working directory is so large, the MRU misses are significant when performing a checkout and updating the .git/index file. To test the performance in this situation, I created a script that organizes the Linux repository in a similar fashion. I split the commit history into 50 parts by creating branches on every 10,000 commits of the first- parent history. Then, `git rev-list --objects A ^B` provides the list of objects reachable from A but not B, so I could send that to `git pack-objects` to create these "time-based" packfiles. With these 50 packfiles (deleting the old one from my fresh clone, and deleting all tags as they were no longer on-disk) I could then test 'git rev-list --objects HEAD^{tree}' and see: Before: 0.17s After: 0.13s % Diff: -23.5% By adding logic to count hits and misses to bsearch_pack, I was able to see that the command above calls that method 266,930 times with a hit rate of 33%. The MIDX has the same number of calls with a 100% hit rate. Abbreviation Speedups - To fully disambiguate an abbreviation, we must iterate through all packfiles to ensure no collision exists in any packfile. This requires O(P log N) time. With the MIDX, this is only O(log N) time. Our standard test [2] is 'git log --oneline --parents --raw' because it writes many abbreviations while also doing a lot of other work (walking commits and trees to compute the raw diff). For a copy of the Linux repository with 50 packfiles split by time, we observed the following: Before: 100.5 s After: 58.2 s % Diff: -59.7% Request for Review Attention I tried my best to take the feedback from the commit-graph feature and apply it to this feature. I also worked to follow the object-store refactoring as I could. I also have some local commits that create a 'verify' subcommand and integrate with 'fsck' similar to the commit-graph, but I'll leave those for a later series (and review is still underway for that part of the commit-graph). One place where I could use some guidance is related to the current state of 'the_hash_algo' patches. The file format allows a different "hash version" which then indicates the length of the hash. What's the best way to ensure this feature doesn't cause extra pain in the hash-agnostic series? This will inform how I go back and make the commit-graph feature better in this area, too. Thanks, -Stolee [1] https://public-inbox.org/git/20180107181459.222909-1-dsto...@microsoft.com/T/#u Previous MIDX RFC. [2] https://public-inbox.org/git/20171012120220.226427-1-dsto...@microsoft.com/ A patch series on