Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-04 Thread Jeff Hostetler



On 10/3/2017 7:42 PM, Jonathan Tan wrote:

On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetler  wrote:


As I see it there are the following major parts to partial clone:
1. How to let git-clone (and later git-fetch) specify the desired
subset of objects that it wants?  (A ref-relative request.)
2. How to let the server and git-pack-objects build that incomplete
packfile?
3. How to remember in the local config that a partial clone (or
fetch) was used and that missing object should be expected?
4. How to dynamically fetch individual missing objects individually?
 (Not a ref-relative request.)
5. How to augment the local ODB with partial clone information and
let git-fsck (and friends) perform limited consistency checking?
6. Methods to bulk fetching missing objects (whether in a pre-verb
hook or in unpack-tree)
7. Miscellaneous issues (e.g. fixing places that accidentally cause
a missing object to be fetched that don't really need it).


Thanks for the enumeration.


As was suggested above, I think we should merge our efforts:
using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
I would need to eliminate the "relax" options in favor of his
is_promised() functionality for index-pack and similar.  And omit
his blob-max-bytes changes from pack-objects, the protocol and
related commands.

That should be a good first step.


This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs
by size, blobs by sparse checkout specification) is more comprehensive
than mine, so removing blob-max-bytes from my code is not a problem.


We both have thoughts on bulk fetching (mine in pre-verb hooks and
his in unpack-tree).  We don't need this immediately, but can wait
until the above is working to revisit.


Agreed.



Thanks.

I'll make a first pass at merging our efforts then and
post something shortly.

Jeff



Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-03 Thread Jonathan Tan
On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetler  wrote:
>
> As I see it there are the following major parts to partial clone:
> 1. How to let git-clone (and later git-fetch) specify the desired
>subset of objects that it wants?  (A ref-relative request.)
> 2. How to let the server and git-pack-objects build that incomplete
>packfile?
> 3. How to remember in the local config that a partial clone (or
>fetch) was used and that missing object should be expected?
> 4. How to dynamically fetch individual missing objects individually?
> (Not a ref-relative request.)
> 5. How to augment the local ODB with partial clone information and
>let git-fsck (and friends) perform limited consistency checking?
> 6. Methods to bulk fetching missing objects (whether in a pre-verb
>hook or in unpack-tree)
> 7. Miscellaneous issues (e.g. fixing places that accidentally cause
>a missing object to be fetched that don't really need it).

Thanks for the enumeration.

> As was suggested above, I think we should merge our efforts:
> using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
> I would need to eliminate the "relax" options in favor of his
> is_promised() functionality for index-pack and similar.  And omit
> his blob-max-bytes changes from pack-objects, the protocol and
> related commands.
>
> That should be a good first step.

This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs
by size, blobs by sparse checkout specification) is more comprehensive
than mine, so removing blob-max-bytes from my code is not a problem.

> We both have thoughts on bulk fetching (mine in pre-verb hooks and
> his in unpack-tree).  We don't need this immediately, but can wait
> until the above is working to revisit.

Agreed.


Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-03 Thread Jeff Hostetler



On 10/3/2017 4:50 AM, Junio C Hamano wrote:

Christian Couder  writes:


Could you give a bit more details about the use cases this is designed for?
It seems that when people review my work they want a lot of details
about the use cases, so I guess they would also be interesting in
getting this kind of information for your work too.

Could this support users who would be interested in lazily cloning
only one kind of files, for example *.jpeg?


I do not know about others, but the reason why I was not interested
in finding out "use cases" is because the value of this series is
use-case agnostic.

At least to me, the most interesting part of the series is that it
allows you to receive a set of objects transferred from the other
side that lack some of objects that would otherwise be required to
be here for connectivity purposes, and it introduces a mechanism to
allow object transfer layer, gc and fsck to work well together in
the resulting repository that deliberately lacks some objects.  The
transfer layer marks the objects obtained from a specific remote as
such, and gc and fsck are taught not to try to follow a missing link
or diagnose a missing link as an error, if a missing link is
expected using the mark the transfer layer left.

and it does so in such a way that it is use-case agnostic.  The
mechanism does not care how the objects to be omitted were chosen,
and how the omission criteria were negotiated between the sender and
the receiver of the pack.

I think the series comes with one filter that is size-based, but I
view it as a technology demonstration.  It does not have to be the
primary use case.  IOW, I do not think the series is meant as a
declaration that size-based filtering is the most important thing
and other omission criteria are less important.

You should be able to build path based omission (i.e. narrow clone)
or blobtype based omission.  Depending on your needs, you may want
different object omission criteria.  It is something you can build
on top.  And the work done on transfer/gc/fsck in this series does
not have to change to accommodate these different "use cases".



Agreed.

There are lots of reasons for wanting partial clones (and we've been
flinging lots of RFCs at each other that each seem to have a different
base assumption (small-blobs-only vs sparse-checkout vs ))
and not reaching consensus or closure.

The main thing is to allow the client to use partial clone to request
a "subset", let the server deliver that "subset", and let the client
tooling deal with an incomplete view of the repo.

As I see it there are the following major parts to partial clone:
1. How to let git-clone (and later git-fetch) specify the desired
   subset of objects that it wants?  (A ref-relative request.)
2. How to let the server and git-pack-objects build that incomplete
   packfile?
3. How to remember in the local config that a partial clone (or
   fetch) was used and that missing object should be expected?
4. How to dynamically fetch individual missing objects individually?
(Not a ref-relative request.)
5. How to augment the local ODB with partial clone information and
   let git-fsck (and friends) perform limited consistency checking?
6. Methods to bulk fetching missing objects (whether in a pre-verb
   hook or in unpack-tree)
7. Miscellaneous issues (e.g. fixing places that accidentally cause
   a missing object to be fetched that don't really need it).

My proposal [1] includes a generic filtering mechanism that handles 3
types of filtering and makes it easy to add other techniques as we
see fit.  It slips in at the list-objects / traverse_commit_list
level and hides all of the details from rev-list and pack-objects.
I have a follow on proposal [2] that extends the filtering parameter
handling to git-clone, git-fetch, git-fetch-pack, git-upload-pack
and the pack protocol.  That takes care of items 1 and 2 above.

Jonathan's proposal [3] includes code to update the local config,
dynamically fetch individual objects, and handle the local ODB and
fsck consistency checking.  That takes care of items 3, 4, and 5.

As was suggested above, I think we should merge our efforts:
using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
I would need to eliminate the "relax" options in favor of his
is_promised() functionality for index-pack and similar.  And omit
his blob-max-bytes changes from pack-objects, the protocol and
related commands.

That should be a good first step.

We both have thoughts on bulk fetching (mine in pre-verb hooks and
his in unpack-tree).  We don't need this immediately, but can wait
until the above is working to revisit.

[1] https://github.com/jeffhostetler/git/pull/3
[2]https://github.com/jeffhostetler/git/pull/4
[3] https://github.com/jonathantanmy/git/tree/partialclone3

Thoughts?

Thanks,
Jeff


Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-03 Thread Junio C Hamano
Christian Couder  writes:

> Could you give a bit more details about the use cases this is designed for?
> It seems that when people review my work they want a lot of details
> about the use cases, so I guess they would also be interesting in
> getting this kind of information for your work too.
>
> Could this support users who would be interested in lazily cloning
> only one kind of files, for example *.jpeg?

I do not know about others, but the reason why I was not interested
in finding out "use cases" is because the value of this series is
use-case agnostic.

At least to me, the most interesting part of the series is that it
allows you to receive a set of objects transferred from the other
side that lack some of objects that would otherwise be required to
be here for connectivity purposes, and it introduces a mechanism to
allow object transfer layer, gc and fsck to work well together in
the resulting repository that deliberately lacks some objects.  The
transfer layer marks the objects obtained from a specific remote as
such, and gc and fsck are taught not to try to follow a missing link
or diagnose a missing link as an error, if a missing link is
expected using the mark the transfer layer left.

and it does so in such a way that it is use-case agnostic.  The
mechanism does not care how the objects to be omitted were chosen,
and how the omission criteria were negotiated between the sender and
the receiver of the pack.

I think the series comes with one filter that is size-based, but I
view it as a technology demonstration.  It does not have to be the
primary use case.  IOW, I do not think the series is meant as a
declaration that size-based filtering is the most important thing
and other omission criteria are less important.

You should be able to build path based omission (i.e. narrow clone)
or blobtype based omission.  Depending on your needs, you may want
different object omission criteria.  It is something you can build
on top.  And the work done on transfer/gc/fsck in this series does
not have to change to accommodate these different "use cases".




Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-03 Thread Christian Couder
On Fri, Sep 29, 2017 at 10:11 PM, Jonathan Tan  wrote:
> These patches are also available online:
> https://github.com/jonathantanmy/git/commits/partialclone3
>
> (I've announced it in another e-mail, but am now sending the patches to the
> mailing list too.)
>
> Here's an update of my work so far. Notable features:
>  - These 18 patches allow a user to clone with --blob-max-bytes=,
>creating a partial clone that is automatically configured to lazily
>fetch missing objects from the origin. The local repo also has fsck
>working offline, and GC working (albeit only on locally created
>objects).
>  - Cloning and fetching is currently only able to exclude blobs by a
>size threshold, but the local repository is already capable of
>fetching missing objects of any type. For example, if a repository
>with missing trees or commits is generated by any tool (for example,
>a future version of Git), current Git with my patches will still be
>able to operate on them, automatically fetching those missing trees
>and commits when needed.
>  - Missing blobs are fetched all at once during checkout.

Could you give a bit more details about the use cases this is designed for?
It seems that when people review my work they want a lot of details
about the use cases, so I guess they would also be interesting in
getting this kind of information for your work too.

Could this support users who would be interested in lazily cloning
only one kind of files, for example *.jpeg?


Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-10-01 Thread Junio C Hamano
Jonathan Tan  writes:

> Jeff Hostetler has sent out some object-filtering patches [1] that is a
> superset of the object-filtering functionality that I have (in the
> pack-objects patches). I have gone for the minimal approach here, but if
> his patches are merged, I'll update my patch set to use those.
>
> [1] https://public-inbox.org/git/20170922203017.53986-6-...@jeffhostetler.com/

Sounds good.  Or perhaps rebasing the other way around, if we feel
that the "fsck with known-missing object" part of your series is
with a better done-ness than Jeff's series (which is my impression
but I has an obvious bias that I happened to have reviewed your
series with finer toothed comb before I saw Jeff's series).

Thanks for working well together ;-).


Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-09-29 Thread Johannes Schindelin
Hi Jonathan,

On Fri, 29 Sep 2017, Jonathan Tan wrote:

> Jeff Hostetler has sent out some object-filtering patches [1] that is a
> superset of the object-filtering functionality that I have (in the
> pack-objects patches). I have gone for the minimal approach here, but if
> his patches are merged, I'll update my patch set to use those.

I wish there was a way for you to work *with* Jeff on this. It seems that
your aims are similar enough for that (you both need changes in the
protocol) yet different enough to allow for talking past each other (big
blobs vs narrow clone).

And I get the impression that in this instance, it slows everything down
to build competing, large patch series rather than building on top of each
other's work.

Additionally, I am not helping by pestering Jeff all the time about
different issues, so it is partially my fault.

But maybe there is a chance to *really* go for a minimal approach, as in
"incremental enough that you can share the first  patches"? And even
better: "come up with those first  patches together"?

Ciao,
Dscho




[PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)

2017-09-29 Thread Jonathan Tan
These patches are also available online:
https://github.com/jonathantanmy/git/commits/partialclone3

(I've announced it in another e-mail, but am now sending the patches to the
mailing list too.)

Here's an update of my work so far. Notable features:
 - These 18 patches allow a user to clone with --blob-max-bytes=,
   creating a partial clone that is automatically configured to lazily
   fetch missing objects from the origin. The local repo also has fsck
   working offline, and GC working (albeit only on locally created
   objects).
 - Cloning and fetching is currently only able to exclude blobs by a
   size threshold, but the local repository is already capable of
   fetching missing objects of any type. For example, if a repository
   with missing trees or commits is generated by any tool (for example,
   a future version of Git), current Git with my patches will still be
   able to operate on them, automatically fetching those missing trees
   and commits when needed.
 - Missing blobs are fetched all at once during checkout.

Jeff Hostetler has sent out some object-filtering patches [1] that is a
superset of the object-filtering functionality that I have (in the
pack-objects patches). I have gone for the minimal approach here, but if
his patches are merged, I'll update my patch set to use those.

[1] https://public-inbox.org/git/20170922203017.53986-6-...@jeffhostetler.com/

Demo


Obtain a repository.

$ make prefix=$HOME/local install
$ cd $HOME/tmp
$ git clone https://github.com/git/git

Make it advertise the new feature and allow requests for arbitrary blobs.

$ git -C git config uploadpack.advertiseblobmaxbytes 1
$ git -C git config uploadpack.allowanysha1inwant 1

Perform the partial clone and check that it is indeed smaller. Specify
"file://" in order to test the partial clone mechanism. (If not, Git will
perform a local clone, which unselectively copies every object.)

$ git clone --blob-max-bytes=0 "file://$(pwd)/git" git2
$ git clone "file://$(pwd)/git" git3
$ du -sh git2 git3
85M git2
130Mgit3

Observe that the new repo is automatically configured to fetch missing objects
from the original repo. Subsequent fetches will also be partial.

$ cat git2/.git/config
[core]
repositoryformatversion = 1
filemode = true
bare = false
logallrefupdates = true
[remote "origin"]
url = [snip]
fetch = +refs/heads/*:refs/remotes/origin/*
blobmaxbytes = 0
[extensions]
partialclone = origin
[branch "master"]
remote = origin
merge = refs/heads/master

Design
==

Local repository layout
---

A repository declares its dependence on a *promisor remote* (a remote that
declares that it can serve certain objects when requested) by a repository
extension "partialclone". `extensions.partialclone` must be set to the name of
the remote ("origin" in the demo above).

A packfile can be annotated as originating from the promisor remote by the
existence of a "(packfile name).promisor" file with arbitrary contents (similar
to the ".keep" file). Whenever a promisor remote sends an object, it declares
that it can serve every object directly or indirectly referenced by the sent
object.

A promisor packfile is a packfile annotated with the ".promisor" file. A
promisor object is an object that the promisor remote is known to be able to
serve, because it is an object in a promisor packfile or directly referred to by
one.

(In the future, we might need to add ".promisor" support to loose objects.)

Connectivity check and gc
-

The object walk done by the connectivity check (as used by fsck and fetch) stops
at all promisor objects.

The object walk done by gc also stops at all promisor objects. Only non-promisor
packfiles are deleted (if pack deletion is requested); promisor packfiles are
left alone. This maintains the distinction between promisor packfiles and
non-promisor packfiles. (In the future, we might need to do something more
sophisticated with promisor packfiles.)

Fetching of missing objects
---

When `sha1_object_info_extended()` (or similar) is invoked, it will
automatically attempt to fetch a missing object from the promisor remote if that
object is not in the local repository. For efficiency, no check is made as to
whether that object is known to be a promisor object or not.

This automatic fetching can be toggled on and off by the `fetch_if_missing`
global variable, and it is on by default.

The actual fetch is done through the fetch-pack/upload-pack protocol. Right now,
this uses the fact that upload-pack allows blob and tree "want"s, and this
incurs the overhead of the unnecessary ref advertisement. I hope that protocol
v2 will allow us to declare that blob and tree "want"s are allowed, and allow
the client to declare that it does not want the ref advertisement. All packfiles
downloaded in this way