Re: [PATCH v6 09/40] Add initial external odb support

Christian Couder Tue, 03 Oct 2017 02:45:33 -0700

On Fri, Sep 29, 2017 at 10:36 PM, Jonathan Tan <jonathanta...@google.com> wrote:
> On Wed, 27 Sep 2017 18:46:30 +0200
> Christian Couder <christian.cou...@gmail.com> wrote:


>> I don't think single-shot processes would be a huge burden, because
>> the code is simpler, and because for example for filters we already
>> have single shot and long-running processes and no one complains about
>> that. It's code that is useful as it makes it much easier for people
>> to do some things (see the clone bundle example).
>>
>> In fact in Git development we usually start to by first implementing
>> simpler single-shot solutions, before thinking, when the need arise,
>> to make it faster. So a perhaps an equally valid opinion could be to
>> first only submit the patches for the single-shot protocol and later
>> submit the rest of the series when we start getting feedback about how
>> external odbs are used.
>
> My concern is that, as far as I understand about the Microsoft use case,
> we already know that we need the faster solution, so the need has
> already arisen.

Yeah, some people need the faster solution, but my opinion is that
many other people would prefer the single shot protocol.
If all you want to do is a simple resumable clone using bundles for
example, then the long running process solution is very much overkill.

For example with filters there are people using them to do keyword
expansion (maybe to emulate the way Subversion and CVS substitutes
keywords like $Id$, $Author$ and so on). It would be really bad to
deprecate the single shot filters and tell those people they now have
to use long running processes because we don't want to maintain the
small code that make single shot filters work.

The Microsoft GVFS use case is just one use case that is very far from
what most people need. And my opinion is that many more people could
benefit from the single shot protocol. For example many people and
admins could benefit from resumable clones using bundles and, if I
remove the single shot protocol, this use case will be unnecessarily
more difficult to implement in the same way as keyword expansion would
be unnecessarily more difficult to implement if we removed the single
shot filters.

See the first article in
https://git.github.io/rev_news/2016/03/16/edition-13/ about resumable
clone if you are not convinced that resumable clones are not an old
and important problem.

>> And yeah I could change the order of the patch series to implement the
>> long-running processes first and the single-shot process last, so that
>> it could be possible to first get feedback about the long-running
>> processes, before we decide to merge or not the single-shot stuff, but
>> I don't think it would look like the most logical order.
>
> My thinking was that we would just implement the long-running process
> and not implement the single-shot process at all (besides maybe a script
> in contrib/). If we are going to do both anyway, I agree that we should
> do the single-shot process first.

Nice to hear that!

>> > And I think that my design can be extended to support a use case in
>> > which, for example, blobs corresponding to a certain type of filename
>> > (defined by a glob like in gitattributes) can be excluded during
>> > fetch/clone, much like --blob-max-bytes, and they can be fetched either
>> > through the built-in mechanism or through a custom hook.
>>
>> Sure, we could probably rebuild something equivalent to what I did on
>> top of your design.
>> My opinion though is that if we want to eventually get to the same
>> goal, it is better to first merge something that get us very close to
>> the end goal and then add some improvements on top of it.
>
> I agree

So are you ok to rebase your patch series on top of my patch series?

My opinion is that my patch series is trying to get to the end goal,
and succeeding to a very large extent, with the smallest amount of
deep technical changes as possible, and that it is the right way to
approach this problem for the following reasons:

1) The root problem is that the current object stores (packfiles and
loose object files) are not good ways to store some objects,
especially some blobs.

2) This root problem cannot be dealt with by Git itself without any
help from external programs, because Git cannot realistically
implement many different object stores (like http servers, artifact
stores, etc). So Git must be improved so that it becomes capable of
communicating with external object stores.

3) As the Git protocol uses packfiles to send objects and is not very
flexible, it might be better if external stores can also be used to
transfer objects that are not stored any more in the current object
stores. (As packfiles are not good for storing some objects, they are
probably also not a good format for sending them. Also, as the Git
protocol is not resumable, we might easily be able to implement
resumable clones if we let external stores handle some transfer.)

4) Making it easy and flexible to exchange objects (and maybe meta
information) with the external stores is very important.

5) Protocol changes are more difficult than many other code changes,
so we should care a lot about the protocol between Git and external
stores.

> - I mentioned that because I personally prefer to review smaller
> patch sets at a time,

I am ok with sending small patch sets, and I will send smaller patch
sets about this from now on.

> and my patch set already includes a lot of the
> same infrastructure needed by yours - for example, the places in the
> code to dynamically fetch objects, exclusion of objects when fetching or
> cloning, configuring the cloned repo when cloning, fsck, and gc.

I agree that your patch set already includes some infrastructure that
could be used by my work, and your patch sets are perhaps implementing
some of this infrastructure better than in my work (I haven't taken a
deep look). But I really think that the right approach is to focus
first on designing a flexible protocol between Git and external
stores. Then the infrastructure work should be related to improving or
enabling the flexible protocol and the communication between Git and
external stores.

Doing infrastructure work first and improving things on top of this
new infrastructure without relying first on a design of the protocol
between Git and external stores is not the best approach as I think we
might over engineer some infrastructure work or base some user
interfaces on the infrastructure work and not on the end goal.

For example if we improve the current protocol, which is not
necessarily a bad thing in itself, we might forget that for resumable
clone it is much better if we just let external stores and helpers
handle the transfer.

I am not saying that doing infrastructure work is bad or will not in
the end let us reach our goals, but I see it as something that is
potentially distracting, or misleading, from focusing first on the
protocol between Git and external stores.

>> >  - I get compile errors when I "git am" these onto master. I think
>> >    '#include "config.h"' is needed in some places.
>>
>> It's strange because I get no compile errors even after a "make clean"
>> from my branch.
>> Could you show the actual errors?
>
> I don't have the error messages with me now, but it was something about
> a function being implicitly declared. You will probably get these errors
> if you sync past commit e67a57f ("config: create config.h", 2017-06-15).

I am past this commit and I get no errors.
I rebased on top of: ea220ee40c "The eleventh batch for 2.15"

>> > Any reason why you prefer to update the loose object functions than to
>> > update the generic one (sha1_object_info_extended)? My concern with just
>> > updating the loose object functions was that a caller might have
>> > obtained the path by iterating through the loose object dirs, and in
>> > that case we shouldn't query the external ODB for anything.
>>
>> You are thinking about fsck or gc?
>> Otherwise I don't think it would be clean to iterate through loose object 
>> dirs.
>
> Yes, fsck and gc (well, prune, I think) do that. I agree that Git
> typically doesn't do that (except for exceptional cases like fsck and
> gc), but I was thinking about supporting existing code that does that
> iteration, not introducing new code that does that.

I haven't taken a look at how fsck and prune work and this is still
code that Peff wrote (though a long time ago), so I tend to trust it.
But I will take a look, and if it is indeed better for them, I am ok
to update sha1_object_info_extended() instead of loose object
functions.

Thanks.

Re: [PATCH v6 09/40] Add initial external odb support

Reply via email to