16] Add initial experimental external ODB support

Christian Couder Tue, 13 Dec 2016 09:22:56 -0800

On Sat, Dec 3, 2016 at 7:47 PM, Lars Schneider <[email protected]> wrote:
>
>> On 30 Nov 2016, at 22:04, Christian Couder <[email protected]> 
>> wrote:
>>
>> Goal
>> ~~~~
>>
>> Git can store its objects only in the form of loose objects in
>> separate files or packed objects in a pack file.
>>
>> To be able to better handle some kind of objects, for example big
>> blobs, it would be nice if Git could store its objects in other object
>> databases (ODB).
>
> This is a great goal. I really hope we can use that to solve the
> pain points in the current Git <--> GitLFS integration!


Yeah, I hope it will help too.

> Thanks for working on this!
>
> Minor nit: I feel the term "other" could be more expressive. Plus
> "database" might confuse people. What do you think about
> "External Object Storage" or something?

In the current Git code, "DB" is already used a lot. For example in
cache.h there is:

#define DB_ENVIRONMENT "GIT_OBJECT_DIRECTORY"

#define ALTERNATE_DB_ENVIRONMENT "GIT_ALTERNATE_OBJECT_DIRECTORIES"

#define INIT_DB_QUIET 0x0001
#define INIT_DB_EXIST_OK 0x0002

extern int init_db(const char *git_dir, const char *real_git_dir,
           const char *template_dir, unsigned int flags);

[...]

>>  - "<command> get <sha1>": the command should then read from the
>> external ODB the content of the object corresponding to <sha1> and
>> output it on stdout.
>>
>>  - "<command> put <sha1> <size> <type>": the command should then read
>> from stdin an object and store it in the external ODB.
>
> Based on my experience with Git clean/smudge filters I think this kind
> of single shot protocol will be a performance bottleneck as soon as
> people store more than >1000 files in the external ODB.
> Maybe you can reuse my "filter process protocol" (edcc858) here?

Yeah, I would like to do reuse your "filter process protocol" as much
as possible to improve this in the future.

>> * Transfer
>>
>> To tranfer information about the blobs stored in external ODB, some
>> special refs, called "odb ref", similar as replace refs, are used.
>>
>> For now there should be one odb ref per blob. Each ref name should be
>> refs/odbs/<odbname>/<sha1> where <sha1> is the sha1 of the blob stored
>> in the external odb named <odbname>.
>>
>> These odb refs should all point to a blob that should be stored in the
>> Git repository and contain information about the blob stored in the
>> external odb. This information can be specific to the external odb.
>> The repos can then share this information using commands like:
>>
>> `git fetch origin "refs/odbs/<odbname>/*:refs/odbs/<odbname>/*"`
>
> The "odbref" would point to a blob and the blob could contain anything,
> right? E.g. it could contain an existing GitLFS pointer, right?
>
> version https://git-lfs.github.com/spec/v1
> oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
> size 12345

Yes, but I think that the sha1 should be added. So yes, it could
easily be made compatible with git LFS.

>> Design discussion about performance
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> Yeah, it is not efficient to fork/exec a command to just read or write
>> one object to or from the external ODB. Batch calls and/or using a
>> daemon and/or RPC should be used instead to be able to store regular
>> objects in an external ODB. But for now the external ODB would be all
>> about really big files, where the cost of a fork+exec should not
>> matter much. If we later want to extend usage of external ODBs, yeah
>> we will probably need to design other mechanisms.
>
> I think we should leverage the learnings from GitLFS as much as possible.
> My learnings are:
>
> (1) Fork/exec per object won't work. People have lots and lots of content
>     that is not suited for Git (e.g. integration test data, images, ...).

I agree that it will not work for many people, but look at how git LFS
evolved. It first started without a good solution for those people,
and then you provided a much better solution to them.
So I am a bit reluctant to work on a complex solution reusing your
"filter process protocol" work right away.

> (2) We need a good UI. I think it would be great if the average user would
>     not even need to know about ODB. Moving files explicitly with a "put"
>     command seems unpractical to me. GitLFS tracks files via filename and
>     that has a number of drawbacks, too. Do you see a way to define a
>     customizable metric such as "move all files to ODB X that are gzip
>     compressed larger than Y"?

I think these should be defined in the config and attributes files. It
could also be possible to implement a "want" command (in the same way
as the "get", "put" and "have" commands) to ask the e-odb helper if it
wants to store a specific blob.

>> Future work
>> ~~~~~~~~~~~
>>
>> I think that the odb refs don't prevent a regular fetch or push from
>> wanting to send the objects that are managed by an external odb. So I
>> am interested in suggestions about this problem. I will take a look at
>> previous discussions and how other mechanisms (shallow clone, bundle
>> v3, ...) handle this.
>
> If the ODB configuration is stored in the Git repo similar to
> .gitmodules then every client that clones ODB references would be able
> to resolve them, right?

Yeah, but I am not sure that being able to resolve the odb refs will
prevent the big blobs from being sent.
With Git LFS, git doesn't know about the big blobs, only about the
substituted files, but that is not the case in what I am doing.

Thanks,
Christian.

Re: [RFC/PATCH v3 00/16] Add initial experimental external ODB support

Reply via email to