Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
On 10/3/2017 7:42 PM, Jonathan Tan wrote: On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetlerwrote: As I see it there are the following major parts to partial clone: 1. How to let git-clone (and later git-fetch) specify the desired subset of objects that it wants? (A ref-relative request.) 2. How to let the server and git-pack-objects build that incomplete packfile? 3. How to remember in the local config that a partial clone (or fetch) was used and that missing object should be expected? 4. How to dynamically fetch individual missing objects individually? (Not a ref-relative request.) 5. How to augment the local ODB with partial clone information and let git-fsck (and friends) perform limited consistency checking? 6. Methods to bulk fetching missing objects (whether in a pre-verb hook or in unpack-tree) 7. Miscellaneous issues (e.g. fixing places that accidentally cause a missing object to be fetched that don't really need it). Thanks for the enumeration. As was suggested above, I think we should merge our efforts: using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5. I would need to eliminate the "relax" options in favor of his is_promised() functionality for index-pack and similar. And omit his blob-max-bytes changes from pack-objects, the protocol and related commands. That should be a good first step. This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs by size, blobs by sparse checkout specification) is more comprehensive than mine, so removing blob-max-bytes from my code is not a problem. We both have thoughts on bulk fetching (mine in pre-verb hooks and his in unpack-tree). We don't need this immediately, but can wait until the above is working to revisit. Agreed. Thanks. I'll make a first pass at merging our efforts then and post something shortly. Jeff
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetlerwrote: > > As I see it there are the following major parts to partial clone: > 1. How to let git-clone (and later git-fetch) specify the desired >subset of objects that it wants? (A ref-relative request.) > 2. How to let the server and git-pack-objects build that incomplete >packfile? > 3. How to remember in the local config that a partial clone (or >fetch) was used and that missing object should be expected? > 4. How to dynamically fetch individual missing objects individually? > (Not a ref-relative request.) > 5. How to augment the local ODB with partial clone information and >let git-fsck (and friends) perform limited consistency checking? > 6. Methods to bulk fetching missing objects (whether in a pre-verb >hook or in unpack-tree) > 7. Miscellaneous issues (e.g. fixing places that accidentally cause >a missing object to be fetched that don't really need it). Thanks for the enumeration. > As was suggested above, I think we should merge our efforts: > using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5. > I would need to eliminate the "relax" options in favor of his > is_promised() functionality for index-pack and similar. And omit > his blob-max-bytes changes from pack-objects, the protocol and > related commands. > > That should be a good first step. This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs by size, blobs by sparse checkout specification) is more comprehensive than mine, so removing blob-max-bytes from my code is not a problem. > We both have thoughts on bulk fetching (mine in pre-verb hooks and > his in unpack-tree). We don't need this immediately, but can wait > until the above is working to revisit. Agreed.
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
On 10/3/2017 4:50 AM, Junio C Hamano wrote: Christian Couderwrites: Could you give a bit more details about the use cases this is designed for? It seems that when people review my work they want a lot of details about the use cases, so I guess they would also be interesting in getting this kind of information for your work too. Could this support users who would be interested in lazily cloning only one kind of files, for example *.jpeg? I do not know about others, but the reason why I was not interested in finding out "use cases" is because the value of this series is use-case agnostic. At least to me, the most interesting part of the series is that it allows you to receive a set of objects transferred from the other side that lack some of objects that would otherwise be required to be here for connectivity purposes, and it introduces a mechanism to allow object transfer layer, gc and fsck to work well together in the resulting repository that deliberately lacks some objects. The transfer layer marks the objects obtained from a specific remote as such, and gc and fsck are taught not to try to follow a missing link or diagnose a missing link as an error, if a missing link is expected using the mark the transfer layer left. and it does so in such a way that it is use-case agnostic. The mechanism does not care how the objects to be omitted were chosen, and how the omission criteria were negotiated between the sender and the receiver of the pack. I think the series comes with one filter that is size-based, but I view it as a technology demonstration. It does not have to be the primary use case. IOW, I do not think the series is meant as a declaration that size-based filtering is the most important thing and other omission criteria are less important. You should be able to build path based omission (i.e. narrow clone) or blobtype based omission. Depending on your needs, you may want different object omission criteria. It is something you can build on top. And the work done on transfer/gc/fsck in this series does not have to change to accommodate these different "use cases". Agreed. There are lots of reasons for wanting partial clones (and we've been flinging lots of RFCs at each other that each seem to have a different base assumption (small-blobs-only vs sparse-checkout vs )) and not reaching consensus or closure. The main thing is to allow the client to use partial clone to request a "subset", let the server deliver that "subset", and let the client tooling deal with an incomplete view of the repo. As I see it there are the following major parts to partial clone: 1. How to let git-clone (and later git-fetch) specify the desired subset of objects that it wants? (A ref-relative request.) 2. How to let the server and git-pack-objects build that incomplete packfile? 3. How to remember in the local config that a partial clone (or fetch) was used and that missing object should be expected? 4. How to dynamically fetch individual missing objects individually? (Not a ref-relative request.) 5. How to augment the local ODB with partial clone information and let git-fsck (and friends) perform limited consistency checking? 6. Methods to bulk fetching missing objects (whether in a pre-verb hook or in unpack-tree) 7. Miscellaneous issues (e.g. fixing places that accidentally cause a missing object to be fetched that don't really need it). My proposal [1] includes a generic filtering mechanism that handles 3 types of filtering and makes it easy to add other techniques as we see fit. It slips in at the list-objects / traverse_commit_list level and hides all of the details from rev-list and pack-objects. I have a follow on proposal [2] that extends the filtering parameter handling to git-clone, git-fetch, git-fetch-pack, git-upload-pack and the pack protocol. That takes care of items 1 and 2 above. Jonathan's proposal [3] includes code to update the local config, dynamically fetch individual objects, and handle the local ODB and fsck consistency checking. That takes care of items 3, 4, and 5. As was suggested above, I think we should merge our efforts: using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5. I would need to eliminate the "relax" options in favor of his is_promised() functionality for index-pack and similar. And omit his blob-max-bytes changes from pack-objects, the protocol and related commands. That should be a good first step. We both have thoughts on bulk fetching (mine in pre-verb hooks and his in unpack-tree). We don't need this immediately, but can wait until the above is working to revisit. [1] https://github.com/jeffhostetler/git/pull/3 [2]https://github.com/jeffhostetler/git/pull/4 [3] https://github.com/jonathantanmy/git/tree/partialclone3 Thoughts? Thanks, Jeff
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
Christian Couderwrites: > Could you give a bit more details about the use cases this is designed for? > It seems that when people review my work they want a lot of details > about the use cases, so I guess they would also be interesting in > getting this kind of information for your work too. > > Could this support users who would be interested in lazily cloning > only one kind of files, for example *.jpeg? I do not know about others, but the reason why I was not interested in finding out "use cases" is because the value of this series is use-case agnostic. At least to me, the most interesting part of the series is that it allows you to receive a set of objects transferred from the other side that lack some of objects that would otherwise be required to be here for connectivity purposes, and it introduces a mechanism to allow object transfer layer, gc and fsck to work well together in the resulting repository that deliberately lacks some objects. The transfer layer marks the objects obtained from a specific remote as such, and gc and fsck are taught not to try to follow a missing link or diagnose a missing link as an error, if a missing link is expected using the mark the transfer layer left. and it does so in such a way that it is use-case agnostic. The mechanism does not care how the objects to be omitted were chosen, and how the omission criteria were negotiated between the sender and the receiver of the pack. I think the series comes with one filter that is size-based, but I view it as a technology demonstration. It does not have to be the primary use case. IOW, I do not think the series is meant as a declaration that size-based filtering is the most important thing and other omission criteria are less important. You should be able to build path based omission (i.e. narrow clone) or blobtype based omission. Depending on your needs, you may want different object omission criteria. It is something you can build on top. And the work done on transfer/gc/fsck in this series does not have to change to accommodate these different "use cases".
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
On Fri, Sep 29, 2017 at 10:11 PM, Jonathan Tanwrote: > These patches are also available online: > https://github.com/jonathantanmy/git/commits/partialclone3 > > (I've announced it in another e-mail, but am now sending the patches to the > mailing list too.) > > Here's an update of my work so far. Notable features: > - These 18 patches allow a user to clone with --blob-max-bytes=, >creating a partial clone that is automatically configured to lazily >fetch missing objects from the origin. The local repo also has fsck >working offline, and GC working (albeit only on locally created >objects). > - Cloning and fetching is currently only able to exclude blobs by a >size threshold, but the local repository is already capable of >fetching missing objects of any type. For example, if a repository >with missing trees or commits is generated by any tool (for example, >a future version of Git), current Git with my patches will still be >able to operate on them, automatically fetching those missing trees >and commits when needed. > - Missing blobs are fetched all at once during checkout. Could you give a bit more details about the use cases this is designed for? It seems that when people review my work they want a lot of details about the use cases, so I guess they would also be interesting in getting this kind of information for your work too. Could this support users who would be interested in lazily cloning only one kind of files, for example *.jpeg?
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
Jonathan Tanwrites: > Jeff Hostetler has sent out some object-filtering patches [1] that is a > superset of the object-filtering functionality that I have (in the > pack-objects patches). I have gone for the minimal approach here, but if > his patches are merged, I'll update my patch set to use those. > > [1] https://public-inbox.org/git/20170922203017.53986-6-...@jeffhostetler.com/ Sounds good. Or perhaps rebasing the other way around, if we feel that the "fsck with known-missing object" part of your series is with a better done-ness than Jeff's series (which is my impression but I has an obvious bias that I happened to have reviewed your series with finer toothed comb before I saw Jeff's series). Thanks for working well together ;-).
Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
Hi Jonathan, On Fri, 29 Sep 2017, Jonathan Tan wrote: > Jeff Hostetler has sent out some object-filtering patches [1] that is a > superset of the object-filtering functionality that I have (in the > pack-objects patches). I have gone for the minimal approach here, but if > his patches are merged, I'll update my patch set to use those. I wish there was a way for you to work *with* Jeff on this. It seems that your aims are similar enough for that (you both need changes in the protocol) yet different enough to allow for talking past each other (big blobs vs narrow clone). And I get the impression that in this instance, it slows everything down to build competing, large patch series rather than building on top of each other's work. Additionally, I am not helping by pestering Jeff all the time about different issues, so it is partially my fault. But maybe there is a chance to *really* go for a minimal approach, as in "incremental enough that you can share the first patches"? And even better: "come up with those first patches together"? Ciao, Dscho
[PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
These patches are also available online: https://github.com/jonathantanmy/git/commits/partialclone3 (I've announced it in another e-mail, but am now sending the patches to the mailing list too.) Here's an update of my work so far. Notable features: - These 18 patches allow a user to clone with --blob-max-bytes=, creating a partial clone that is automatically configured to lazily fetch missing objects from the origin. The local repo also has fsck working offline, and GC working (albeit only on locally created objects). - Cloning and fetching is currently only able to exclude blobs by a size threshold, but the local repository is already capable of fetching missing objects of any type. For example, if a repository with missing trees or commits is generated by any tool (for example, a future version of Git), current Git with my patches will still be able to operate on them, automatically fetching those missing trees and commits when needed. - Missing blobs are fetched all at once during checkout. Jeff Hostetler has sent out some object-filtering patches [1] that is a superset of the object-filtering functionality that I have (in the pack-objects patches). I have gone for the minimal approach here, but if his patches are merged, I'll update my patch set to use those. [1] https://public-inbox.org/git/20170922203017.53986-6-...@jeffhostetler.com/ Demo Obtain a repository. $ make prefix=$HOME/local install $ cd $HOME/tmp $ git clone https://github.com/git/git Make it advertise the new feature and allow requests for arbitrary blobs. $ git -C git config uploadpack.advertiseblobmaxbytes 1 $ git -C git config uploadpack.allowanysha1inwant 1 Perform the partial clone and check that it is indeed smaller. Specify "file://" in order to test the partial clone mechanism. (If not, Git will perform a local clone, which unselectively copies every object.) $ git clone --blob-max-bytes=0 "file://$(pwd)/git" git2 $ git clone "file://$(pwd)/git" git3 $ du -sh git2 git3 85M git2 130Mgit3 Observe that the new repo is automatically configured to fetch missing objects from the original repo. Subsequent fetches will also be partial. $ cat git2/.git/config [core] repositoryformatversion = 1 filemode = true bare = false logallrefupdates = true [remote "origin"] url = [snip] fetch = +refs/heads/*:refs/remotes/origin/* blobmaxbytes = 0 [extensions] partialclone = origin [branch "master"] remote = origin merge = refs/heads/master Design == Local repository layout --- A repository declares its dependence on a *promisor remote* (a remote that declares that it can serve certain objects when requested) by a repository extension "partialclone". `extensions.partialclone` must be set to the name of the remote ("origin" in the demo above). A packfile can be annotated as originating from the promisor remote by the existence of a "(packfile name).promisor" file with arbitrary contents (similar to the ".keep" file). Whenever a promisor remote sends an object, it declares that it can serve every object directly or indirectly referenced by the sent object. A promisor packfile is a packfile annotated with the ".promisor" file. A promisor object is an object that the promisor remote is known to be able to serve, because it is an object in a promisor packfile or directly referred to by one. (In the future, we might need to add ".promisor" support to loose objects.) Connectivity check and gc - The object walk done by the connectivity check (as used by fsck and fetch) stops at all promisor objects. The object walk done by gc also stops at all promisor objects. Only non-promisor packfiles are deleted (if pack deletion is requested); promisor packfiles are left alone. This maintains the distinction between promisor packfiles and non-promisor packfiles. (In the future, we might need to do something more sophisticated with promisor packfiles.) Fetching of missing objects --- When `sha1_object_info_extended()` (or similar) is invoked, it will automatically attempt to fetch a missing object from the promisor remote if that object is not in the local repository. For efficiency, no check is made as to whether that object is known to be a promisor object or not. This automatic fetching can be toggled on and off by the `fetch_if_missing` global variable, and it is on by default. The actual fetch is done through the fetch-pack/upload-pack protocol. Right now, this uses the fact that upload-pack allows blob and tree "want"s, and this incurs the overhead of the unnecessary ref advertisement. I hope that protocol v2 will allow us to declare that blob and tree "want"s are allowed, and allow the client to declare that it does not want the ref advertisement. All packfiles downloaded in this way