Re: [RFC PATCH 0/5] Add delta islands support

2018-07-26 Thread Johannes Schindelin
Hi Chris,


On Sun, 22 Jul 2018, Christian Couder wrote:

> This patch series is upstreaming work made by GitHub and available in:
> 
> https://github.com/peff/git/commits/jk/delta-islands
> 
> The patch in the above branch has been split into 5 patches with their
> own new commit message, but no other change has been made.
> 
> I kept Peff as the author and took the liberty to add his
> Signed-off-by to all the patches.
> 
> The patches look good to me except perhaps that "pack.islandCore" is
> not documented. Maybe something could be added in
> Documentation/technical/ too, or maybe improving commit messages could
> be enough.
> 
> Anyway I wanted to send this nearly "as is" first to get early
> feedback about what should be done.
> 
> This patch series is also available on GitHub in:
> 
> https://github.com/chriscool/git/commits/delta-islands

It might make sense to explain *somewhere* in the cover letter what the
patches are all about, i.e. what problem they are supposed to solve. I did
not see any indication here.

Ciao,
Dscho

> 
> Jeff King (5):
>   packfile: make get_delta_base() non static
>   Add delta-islands.{c,h}
>   pack-objects: add delta-islands support
>   repack: add delta-islands support
>   t: add t9930-delta-islands.sh
> 
>  Documentation/config.txt   |   8 +
>  Documentation/git-pack-objects.txt |  88 ++
>  Documentation/git-repack.txt   |   5 +
>  Makefile   |   1 +
>  builtin/pack-objects.c | 130 +---
>  builtin/repack.c   |   9 +
>  delta-islands.c| 490 +
>  delta-islands.h|  11 +
>  pack-objects.h |   4 +
>  packfile.c |  10 +-
>  packfile.h |   3 +
>  t/t9930-delta-islands.sh   | 143 +
>  12 files changed, 856 insertions(+), 46 deletions(-)
>  create mode 100644 delta-islands.c
>  create mode 100644 delta-islands.h
>  create mode 100755 t/t9930-delta-islands.sh
> 
> -- 
> 2.18.0.237.gffdb1dbdaa
> 
> 


Re: [RFC PATCH 0/5] Add delta islands support

2018-07-24 Thread Jeff King
On Tue, Jul 24, 2018 at 10:18:03AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> >I think this is inherent in the scheme (we're losing some delta
> >opportunities). But I think it's also made worse because the delta
> >window gets clogged with candidates that are forbidden by the island
> >config.
> 
> Hmph, and the reason why objects that do not even belong to the same
> island to be usable as a base are in the object list in the first
> place is...?

Because we are doing a "git repack -ad" here. So we are considering
_all_ of the objects, but avoiding making deltas between some of them.

During an actual fetch, the islands are not used at all (but the delta
relationships left on disk are important for letting us reuse those
deltas as-is).

> >Repacking with a big --window helps (and doesn't take as long
> >as it otherwise might because we can reject some object pairs based
> >on islands before doing any computation on the content).
> 
> Ah, then yes, a large window with early culling based on the delta
> island criteria definitely sounds like the right solution to that
> problem.

I try to account for this somewhat by looking at islands in the sort we
do of the delta candidates.  But to be honest, I am not sure how much
sense that makes (but I did verify experimentally that it helps). Again,
this is one of the reasons I had not sent this previously: I am not at
all confident that it is doing the right thing in all places.

-Peff


Re: [RFC PATCH 0/5] Add delta islands support

2018-07-24 Thread Junio C Hamano
Jeff King  writes:

>I think this is inherent in the scheme (we're losing some delta
>opportunities). But I think it's also made worse because the delta
>window gets clogged with candidates that are forbidden by the island
>config.

Hmph, and the reason why objects that do not even belong to the same
island to be usable as a base are in the object list in the first
place is...?

>Repacking with a big --window helps (and doesn't take as long
>as it otherwise might because we can reject some object pairs based
>on islands before doing any computation on the content).

Ah, then yes, a large window with early culling based on the delta
island criteria definitely sounds like the right solution to that
problem.

>I have replacement code (which we have been running in production)
>that is more clever about the threshold, and also can handle gaps in
>the continuity (so we might realize we need to send objects 1-5000,
>then skip a few, then 5037-8000, and so on).

That vaguely sounds similar to what folks at $DAYJOB runs in their
Gerrit/jgit thing.



Re: [RFC PATCH 0/5] Add delta islands support

2018-07-24 Thread Jeff King
On Sun, Jul 22, 2018 at 07:48:31AM +0200, Christian Couder wrote:

> I kept Peff as the author and took the liberty to add his
> Signed-off-by to all the patches.

Yes, that's fine. This is a topic I've been meaning to upstream for a
long time, so I'm happy to see it progressing in that direction.

A few reasons I've been hesitant:

 - this produces worse packs for people cloning. The packs we serve at
   GitHub for heavily-forked repositories can be significantly larger
   than what you could get by repacking in isolation.

   I think this is inherent in the scheme (we're losing some delta
   opportunities). But I think it's also made worse because the delta
   window gets clogged with candidates that are forbidden by the island
   config. Repacking with a big --window helps (and doesn't take as long
   as it otherwise might because we can reject some object pairs based
   on islands before doing any computation on the content).

   I also haven't done much modeling on how this scheme progresses over
   time. Imagine users tend to create forks at random points in time and
   then let them get stale. You're going to accumulate a bunch of
   islands that increasingly don't overlap with each other. And as time
   goes on, your delta opportunities get worse and worse.

   Which isn't to say the scheme doesn't work to some degree. It has
   been running in production at GitHub for several years. But it's very
   cobbled together, and I had always hoped to give it some more careful
   thought before sending it upstream. ;) That may be "the perfect is
   the enemy of the good" though (or at least the enemy of the
   mediocre).

 - the whole regex island-naming configuration feels complicated and
   hacky to me. I haven't been able to come up with a better solution,
   though.

 - the whole islandCore thing is it's own weirdness (more below)

> The patches look good to me except perhaps that "pack.islandCore" is
> not documented. Maybe something could be added in
> Documentation/technical/ too, or maybe improving commit messages could
> be enough.

So basically what islandCore does is make one island "special", and we
write all of its objects out first in the packfile. The idea here was
that it would interact well with the pack-reuse code which tries to send
the early part of the pack out verbatim. And if you could pick the fork
which is most-often cloned, then that would be a win (so you'd pick
torvalds/linux.git here).

But:

 - the pack-reuse code we have upstream is crap; it doesn't kick in for
   this instance anyway, because it's too concerned with whether you're
   sending all of the pack. So if you have a 10GB pack and the "core"
   island is the entire first 1GB, you should be able to ship out that
   first GB easily. But the code says "well, that's only 10%!", which is
   not enough to trigger.

   I have replacement code (which we have been running in production)
   that is more clever about the threshold, and also can handle gaps in
   the continuity (so we might realize we need to send objects 1-5000,
   then skip a few, then 5037-8000, and so on). And after that, it's not
   at all clear to me if the islandCore thing is actually still helpful.

 - it's hard to configure, because it may actually change over time
   (whereas all the other island config is basically static and just impacts
   how we consider the refs that are there)

 - there may actually be _multiple_ forks that are "special". In
   torvalds/linux, for example, there's a fair bit of hierarchy, and
   there are people who forked Linus and have their own active
   sub-community. The actual island concept handles this kind of
   layering OK, but the islandCore thing is pretty static.

So I'm not sure if islandCore is even needed or not. But somebody would
have to do some experiments to figure it out (and obviously I should
share the replacement pack-reuse patches).

-Peff


[RFC PATCH 0/5] Add delta islands support

2018-07-21 Thread Christian Couder
This patch series is upstreaming work made by GitHub and available in:

https://github.com/peff/git/commits/jk/delta-islands

The patch in the above branch has been split into 5 patches with their
own new commit message, but no other change has been made.

I kept Peff as the author and took the liberty to add his
Signed-off-by to all the patches.

The patches look good to me except perhaps that "pack.islandCore" is
not documented. Maybe something could be added in
Documentation/technical/ too, or maybe improving commit messages could
be enough.

Anyway I wanted to send this nearly "as is" first to get early
feedback about what should be done.

This patch series is also available on GitHub in:

https://github.com/chriscool/git/commits/delta-islands

Jeff King (5):
  packfile: make get_delta_base() non static
  Add delta-islands.{c,h}
  pack-objects: add delta-islands support
  repack: add delta-islands support
  t: add t9930-delta-islands.sh

 Documentation/config.txt   |   8 +
 Documentation/git-pack-objects.txt |  88 ++
 Documentation/git-repack.txt   |   5 +
 Makefile   |   1 +
 builtin/pack-objects.c | 130 +---
 builtin/repack.c   |   9 +
 delta-islands.c| 490 +
 delta-islands.h|  11 +
 pack-objects.h |   4 +
 packfile.c |  10 +-
 packfile.h |   3 +
 t/t9930-delta-islands.sh   | 143 +
 12 files changed, 856 insertions(+), 46 deletions(-)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h
 create mode 100755 t/t9930-delta-islands.sh

-- 
2.18.0.237.gffdb1dbdaa