Re: [RFC PATCH 0/5] Add delta islands support
Hi Chris, On Sun, 22 Jul 2018, Christian Couder wrote: > This patch series is upstreaming work made by GitHub and available in: > > https://github.com/peff/git/commits/jk/delta-islands > > The patch in the above branch has been split into 5 patches with their > own new commit message, but no other change has been made. > > I kept Peff as the author and took the liberty to add his > Signed-off-by to all the patches. > > The patches look good to me except perhaps that "pack.islandCore" is > not documented. Maybe something could be added in > Documentation/technical/ too, or maybe improving commit messages could > be enough. > > Anyway I wanted to send this nearly "as is" first to get early > feedback about what should be done. > > This patch series is also available on GitHub in: > > https://github.com/chriscool/git/commits/delta-islands It might make sense to explain *somewhere* in the cover letter what the patches are all about, i.e. what problem they are supposed to solve. I did not see any indication here. Ciao, Dscho > > Jeff King (5): > packfile: make get_delta_base() non static > Add delta-islands.{c,h} > pack-objects: add delta-islands support > repack: add delta-islands support > t: add t9930-delta-islands.sh > > Documentation/config.txt | 8 + > Documentation/git-pack-objects.txt | 88 ++ > Documentation/git-repack.txt | 5 + > Makefile | 1 + > builtin/pack-objects.c | 130 +--- > builtin/repack.c | 9 + > delta-islands.c| 490 + > delta-islands.h| 11 + > pack-objects.h | 4 + > packfile.c | 10 +- > packfile.h | 3 + > t/t9930-delta-islands.sh | 143 + > 12 files changed, 856 insertions(+), 46 deletions(-) > create mode 100644 delta-islands.c > create mode 100644 delta-islands.h > create mode 100755 t/t9930-delta-islands.sh > > -- > 2.18.0.237.gffdb1dbdaa > >
Re: [RFC PATCH 0/5] Add delta islands support
On Tue, Jul 24, 2018 at 10:18:03AM -0700, Junio C Hamano wrote: > Jeff King writes: > > >I think this is inherent in the scheme (we're losing some delta > >opportunities). But I think it's also made worse because the delta > >window gets clogged with candidates that are forbidden by the island > >config. > > Hmph, and the reason why objects that do not even belong to the same > island to be usable as a base are in the object list in the first > place is...? Because we are doing a "git repack -ad" here. So we are considering _all_ of the objects, but avoiding making deltas between some of them. During an actual fetch, the islands are not used at all (but the delta relationships left on disk are important for letting us reuse those deltas as-is). > >Repacking with a big --window helps (and doesn't take as long > >as it otherwise might because we can reject some object pairs based > >on islands before doing any computation on the content). > > Ah, then yes, a large window with early culling based on the delta > island criteria definitely sounds like the right solution to that > problem. I try to account for this somewhat by looking at islands in the sort we do of the delta candidates. But to be honest, I am not sure how much sense that makes (but I did verify experimentally that it helps). Again, this is one of the reasons I had not sent this previously: I am not at all confident that it is doing the right thing in all places. -Peff
Re: [RFC PATCH 0/5] Add delta islands support
Jeff King writes: >I think this is inherent in the scheme (we're losing some delta >opportunities). But I think it's also made worse because the delta >window gets clogged with candidates that are forbidden by the island >config. Hmph, and the reason why objects that do not even belong to the same island to be usable as a base are in the object list in the first place is...? >Repacking with a big --window helps (and doesn't take as long >as it otherwise might because we can reject some object pairs based >on islands before doing any computation on the content). Ah, then yes, a large window with early culling based on the delta island criteria definitely sounds like the right solution to that problem. >I have replacement code (which we have been running in production) >that is more clever about the threshold, and also can handle gaps in >the continuity (so we might realize we need to send objects 1-5000, >then skip a few, then 5037-8000, and so on). That vaguely sounds similar to what folks at $DAYJOB runs in their Gerrit/jgit thing.
Re: [RFC PATCH 0/5] Add delta islands support
On Sun, Jul 22, 2018 at 07:48:31AM +0200, Christian Couder wrote: > I kept Peff as the author and took the liberty to add his > Signed-off-by to all the patches. Yes, that's fine. This is a topic I've been meaning to upstream for a long time, so I'm happy to see it progressing in that direction. A few reasons I've been hesitant: - this produces worse packs for people cloning. The packs we serve at GitHub for heavily-forked repositories can be significantly larger than what you could get by repacking in isolation. I think this is inherent in the scheme (we're losing some delta opportunities). But I think it's also made worse because the delta window gets clogged with candidates that are forbidden by the island config. Repacking with a big --window helps (and doesn't take as long as it otherwise might because we can reject some object pairs based on islands before doing any computation on the content). I also haven't done much modeling on how this scheme progresses over time. Imagine users tend to create forks at random points in time and then let them get stale. You're going to accumulate a bunch of islands that increasingly don't overlap with each other. And as time goes on, your delta opportunities get worse and worse. Which isn't to say the scheme doesn't work to some degree. It has been running in production at GitHub for several years. But it's very cobbled together, and I had always hoped to give it some more careful thought before sending it upstream. ;) That may be "the perfect is the enemy of the good" though (or at least the enemy of the mediocre). - the whole regex island-naming configuration feels complicated and hacky to me. I haven't been able to come up with a better solution, though. - the whole islandCore thing is it's own weirdness (more below) > The patches look good to me except perhaps that "pack.islandCore" is > not documented. Maybe something could be added in > Documentation/technical/ too, or maybe improving commit messages could > be enough. So basically what islandCore does is make one island "special", and we write all of its objects out first in the packfile. The idea here was that it would interact well with the pack-reuse code which tries to send the early part of the pack out verbatim. And if you could pick the fork which is most-often cloned, then that would be a win (so you'd pick torvalds/linux.git here). But: - the pack-reuse code we have upstream is crap; it doesn't kick in for this instance anyway, because it's too concerned with whether you're sending all of the pack. So if you have a 10GB pack and the "core" island is the entire first 1GB, you should be able to ship out that first GB easily. But the code says "well, that's only 10%!", which is not enough to trigger. I have replacement code (which we have been running in production) that is more clever about the threshold, and also can handle gaps in the continuity (so we might realize we need to send objects 1-5000, then skip a few, then 5037-8000, and so on). And after that, it's not at all clear to me if the islandCore thing is actually still helpful. - it's hard to configure, because it may actually change over time (whereas all the other island config is basically static and just impacts how we consider the refs that are there) - there may actually be _multiple_ forks that are "special". In torvalds/linux, for example, there's a fair bit of hierarchy, and there are people who forked Linus and have their own active sub-community. The actual island concept handles this kind of layering OK, but the islandCore thing is pretty static. So I'm not sure if islandCore is even needed or not. But somebody would have to do some experiments to figure it out (and obviously I should share the replacement pack-reuse patches). -Peff
[RFC PATCH 0/5] Add delta islands support
This patch series is upstreaming work made by GitHub and available in: https://github.com/peff/git/commits/jk/delta-islands The patch in the above branch has been split into 5 patches with their own new commit message, but no other change has been made. I kept Peff as the author and took the liberty to add his Signed-off-by to all the patches. The patches look good to me except perhaps that "pack.islandCore" is not documented. Maybe something could be added in Documentation/technical/ too, or maybe improving commit messages could be enough. Anyway I wanted to send this nearly "as is" first to get early feedback about what should be done. This patch series is also available on GitHub in: https://github.com/chriscool/git/commits/delta-islands Jeff King (5): packfile: make get_delta_base() non static Add delta-islands.{c,h} pack-objects: add delta-islands support repack: add delta-islands support t: add t9930-delta-islands.sh Documentation/config.txt | 8 + Documentation/git-pack-objects.txt | 88 ++ Documentation/git-repack.txt | 5 + Makefile | 1 + builtin/pack-objects.c | 130 +--- builtin/repack.c | 9 + delta-islands.c| 490 + delta-islands.h| 11 + pack-objects.h | 4 + packfile.c | 10 +- packfile.h | 3 + t/t9930-delta-islands.sh | 143 + 12 files changed, 856 insertions(+), 46 deletions(-) create mode 100644 delta-islands.c create mode 100644 delta-islands.h create mode 100755 t/t9930-delta-islands.sh -- 2.18.0.237.gffdb1dbdaa