Re: How hard would it be to implement sparse fetching/pulling?
From: "Jeff Hostetler" Sent: Monday, December 04, 2017 3:36 PM On 12/2/2017 11:30 AM, Philip Oakley wrote: From: "Jeff Hostetler" Sent: Friday, December 01, 2017 2:30 PM On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote: I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. With the current patches (parts 1,2,3) we can pass a blob-ish to the server during a clone that refers to a sparse-checkout specification. I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester. It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching! To be honest, I've always considered partial clone/fetch as a client-side request as a performance feature to minimize download times and disk space requirements on the client. Mine was a two way view where one side or other specified an extent for the narrow clone to achieve either the speed/space improvement or partitioning capability. I've not thought of it from the "server has secrets" point of view. My potential for "secrets" was a little softer that some of the 'hard' security that is often discussed. I'm for the layered risk approach (swiss cheese model) We can talk about it, but I'd like to keep it outside the scope of the current effort. Agreed. My concerns are that that is not the appropriate mechanism to enforce MAC/DAC like security mechanisms. For example: [a] The client will still receive the containing trees that refer to the sensitive blobs, so the user can tell when the secret blobs change -- they wouldn't have either blob, but can tell when they are changed. This event by itself may or may not leak sensitive information depending on the terms of the security policy in place. [b] The existence of such missing blobs would tell the client which blobs are significant and secret and allow them to focus their attack. It would be better if those assets were completely hidden and not in the tree at all. [c] The client could push a fake secret blob to replace the valid one on the server. You would have to audit the server to ensure that it never accepts a push containing a change to any secret blob. And the server would need an infrastructure to know about all secrets in the tree. [d] When a secret blob does change, any local merges by the user lack information to complete the merge -- they can't merge the secrets and they can't be trusted to correctly pick-ours or pick-theirs -- so their workflows are broken. I'm not trying to blindly spread FUD here, but it is arguments like these that make me suggest that the partial clone mechanism is not the right vehicle for such "secret" blobs. I'm on the 'a little security is better than no security' side, but all the points are valid. There's a bit of a chicken-n-egg problem getting things set up. So if we assume your team would create a series of "known enlistments" under version control, then you could s/enlistments/entitlements/ I presume? Within my org we speak of "enlistments" as subset of the tree that you plan to work on. For example, you might enlist in the "file system" portion of the tree or in the "device drivers" portion. If the Makefiles have good partitioning, you should only need one of the above portions to do productive work within a feature area. Ah, so it's the things that have been requested by the client (I'd like to the enlist..) I'm not sure what you mean by "entitlements". It is like having the title deeds to a house - a list things you have, or can have. (e.g. a father saying: you can have the car on Saturday 6pm -11pm) At the end of the day the particular lists would be the same, they guide what is sent. just reference one by : during your clone. The server can lookup that blob and just use it. git clone --filter=sparse:oid=master:templates/bar URL And then the server will filter-out the unwanted blobs during the clone. (The current version only filters blobs; you still get full commits and trees. That will be revisited later.) I'm for the idea that only the in-heirachy trees should be sent. It shou
Re: How hard would it be to implement sparse fetching/pulling?
Hi, Jeff Hostetler wrote: > On 12/2/2017 1:24 PM, Philip Oakley wrote: >> From: "Jeff Hostetler" >> Sent: Friday, December 01, 2017 5:23 PM >>> Discussing this feature in the context of the defense industry >>> makes me a little nervous. (I used to be in that area.) >> >> I'm viewing the desire for codebase partitioning from a soft layering >> of risk view (perhaps a more UK than USA approach ;-) > > I'm not sure I know what this means or how the UK defense > security models/policy/procedures are different from the US, > so I can't say much here. I'm just thinking that even if we > get a *perfectly working* partial clone/fetch/push/etc. that > it would not pass a security audit. I might be wrong here > (and I'm no expert on the subject), but I think they would > push you towards a different solution architecture. I'm pretty ignorant about the defense industry, but a few more comments: - gitolite implements some features on top of git's server code that I consider to be important for security. So much so that I've been considering what it would take to remove the git-shell command from git.git and move it to the gitolite project where people would be better equipped to use it in an appropriate context - in particular, git's reachability checking code could use some hardening/improvement. In particular, think of edge cases like where someone pushes a pack with deltas referring to objects they should not be able to reach. - Anyone willing to audit git code's security wins my approval. Please, please, audit git code and report the issues you find. :) [...] > Also omitting certain trees means you now (obviously) have both missing > trees and blobs. And both need to be dynamically or batch fetched as > needed. And certain operations will need multiple round trips to fully > resolve -- fault in a tree and then fault in blobs referenced by it. For omitting trees, we will need to modify the index format, since the index has entries for all paths today. That's on the roadmap but has not been implemented yet. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
On 12/2/2017 1:24 PM, Philip Oakley wrote: From: "Jeff Hostetler" Sent: Friday, December 01, 2017 5:23 PM On 11/30/2017 6:43 PM, Philip Oakley wrote: [...] Discussing this feature in the context of the defense industry makes me a little nervous. (I used to be in that area.) I'm viewing the desire for codebase partitioning from a soft layering of risk view (perhaps a more UK than USA approach ;-) I'm not sure I know what this means or how the UK defense security models/policy/procedures are different from the US, so I can't say much here. I'm just thinking that even if we get a *perfectly working* partial clone/fetch/push/etc. that it would not pass a security audit. I might be wrong here (and I'm no expert on the subject), but I think they would push you towards a different solution architecture. What we have in the code so far may be a nice start, but probably doesn't have the assurances that you would need for actual deployment. But it's a start True. I need to get some of my collegues more engaged... [...] Yes, this does tend to lead towards an always-online mentality. However, there are 2 parts: [a] dynamic object fetching for missing objects, such as during a random command like diff or blame or merge. We need this regardless of usage -- because we can't always predict (or dry-run) every command the user might run in advance. Making something "useful" happen here when off-line is an obvious goal. [b] batch fetch mode, such as using partial-fetch to match your sparse-checkout so that you always have the blobs of interest to you. And assuming you don't wander outside of this subset of the tree, you should be able to work offline as usual. If you can work within the confines of [b], you wouldn't need to always be online. I feel this is the area that does need ensure a capability to avoid any perception of the much maligned 'Embrace, extend, and extinguish'> by accidental lockout. I don't think this should be viewed as a type of sparse checkout - it's just a checkout of what you have (under the hood it could use the same code though). Right, I'm only thinking of this effort as a way to get a partial clone and fetch that omits unneeded (or, not immediately needed) objects for performance reasons. There are several use scenarios that I've discussed and sparse-checkout is one of them, but I do not consider this to be a sparse-checkout feature. [...] The main problem with markers or other lists of missing objects is that it has scale problems for large repos. Suppose I have 100M blobs in my repo. If I do a blob:none clone, I'd have 100M missing blobs that would need tracking. If I then do a batch fetch of the blobs needed to do a sparse checkout of HEAD, I'd have to remove those entries from the tracking data. Not impossible, but not speedy either. ** Ahhh. I see. That's a consequence of having all the trees isn't it. ** I've always thought that limiting the trees is at the heart of the Narrow clone/fetch problem. OK so if you have flat, wide structures with 10k files/directories per tree then it's still a fair sized problem, but it should *scale logarithmically* for the part of the tree structure that's not being downloaded. You never have to add a marker for a blob that you have no containing tree for. Nor for the tree that contained the blob's tree, all the way up to primary line of descent to the tree of concern. All those trees are never down loaded, there are few markers (.gitNarrowTree files) for those tree stubs, certainly no 100M missing blob markers. Currently, the code only omits blobs. I want to extend the current code to have filters that also exclude unneeded trees. That will help address some of these size concerns, but there are still perf issues here. * Marking of 'missing' objects in the local object store, and on the wire. The missing objects are replaced by a place holder object, which used the same oid/sha1, but has a short fixed length, with content “GitNarrowObject ”. The chance that that string would actually have such an oid clash is the same as all other object hashes, so is a *safe* self-referential device. Again, there is a scale problem here. If I have 100M missing blobs, I can't afford to create 100M loose place holder files. Or juggle a 2GB file of missing objects on various operations. As above, I'm also trimming the trees, so in general, there would be no missing blobs, just the content of the directory one was interested in. That's not quite true if higher level trees have blob references in them that are otherwise unwanted - they may each need a marker. [Or maybe a special single 'tree-of-blobs' marker for them all thus only one marker per tree - over-thinking maybe...] Also omitting certain trees means you now (obviously) have both missing trees and blobs. And both need to be dynamically or batch fetched as needed. And certain operations will need multiple round
Re: How hard would it be to implement sparse fetching/pulling?
On 12/1/2017 1:24 PM, Jonathan Nieder wrote: Jeff Hostetler wrote: On 11/30/2017 6:43 PM, Philip Oakley wrote: The 'companies' problem is that it tends to force a client-server, always-on on-line mentality. I'm also wanting the original DVCS off-line capability to still be available, with _user_ control, in a generic sense, of what they have locally available (including files/directories they have not yet looked at, but expect to have. IIUC Jeff's work is that on-line view, without the off-line capability. I'd commented early in the series at [1,2,3]. Yes, this does tend to lead towards an always-online mentality. However, there are 2 parts: [a] dynamic object fetching for missing objects, such as during a random command like diff or blame or merge. We need this regardless of usage -- because we can't always predict (or dry-run) every command the user might run in advance. [b] batch fetch mode, such as using partial-fetch to match your sparse-checkout so that you always have the blobs of interest to you. And assuming you don't wander outside of this subset of the tree, you should be able to work offline as usual. If you can work within the confines of [b], you wouldn't need to always be online. Just to amplify this: for our internal use we care a lot about disconnected usage working. So it is not like we have forgotten about this use case. We might also add a part [c] with explicit commands to back-fill or alter your incomplete view of the ODB Agreed, this will be a nice thing to add. [...] At its core, my idea was to use the object store to hold markers for the 'not yet fetched' objects (mainly trees and blobs). These would be in a known fixed format, and have the same effect (conceptually) as the sub-module markers - they _confirm_ the oid, yet say 'not here, try elsewhere'. We do have something like this. Jonathan can explain better than I, but basically, we denote possibly incomplete packfiles from partial clones and fetches as "promisor" and have special rules in the code to assert that a missing blob referenced from a "promisor" packfile is OK and can be fetched later if necessary from the "promising" remote. The main problem with markers or other lists of missing objects is that it has scale problems for large repos. Any chance that we can get a design doc in Documentation/technical/ giving an overview of the design, with a brief "alternatives considered" section describing this kind of thing? Yeah, I'll start one. I have notes within the individual protocol docs and man-pages, but no summary doc. Thanks! E.g. some of the earlier descriptions like https://public-inbox.org/git/20170915134343.3814d...@twelve2.svl.corp.google.com/ https://public-inbox.org/git/cover.1506714999.git.jonathanta...@google.com/ https://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/ may help as a starting point. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
On 12/2/2017 11:30 AM, Philip Oakley wrote: From: "Jeff Hostetler" Sent: Friday, December 01, 2017 2:30 PM On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote: I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. With the current patches (parts 1,2,3) we can pass a blob-ish to the server during a clone that refers to a sparse-checkout specification. I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester. It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching! To be honest, I've always considered partial clone/fetch as a client-side request as a performance feature to minimize download times and disk space requirements on the client. I've not thought of it from the "server has secrets" point of view. We can talk about it, but I'd like to keep it outside the scope of the current effort. My concerns are that that is not the appropriate mechanism to enforce MAC/DAC like security mechanisms. For example: [a] The client will still receive the containing trees that refer to the sensitive blobs, so the user can tell when the secret blobs change -- they wouldn't have either blob, but can tell when they are changed. This event by itself may or may not leak sensitive information depending on the terms of the security policy in place. [b] The existence of such missing blobs would tell the client which blobs are significant and secret and allow them to focus their attack. It would be better if those assets were completely hidden and not in the tree at all. [c] The client could push a fake secret blob to replace the valid one on the server. You would have to audit the server to ensure that it never accepts a push containing a change to any secret blob. And the server would need an infrastructure to know about all secrets in the tree. [d] When a secret blob does change, any local merges by the user lack information to complete the merge -- they can't merge the secrets and they can't be trusted to correctly pick-ours or pick-theirs -- so their workflows are broken. I'm not trying to blindly spread FUD here, but it is arguments like these that make me suggest that the partial clone mechanism is not the right vehicle for such "secret" blobs. There's a bit of a chicken-n-egg problem getting things set up. So if we assume your team would create a series of "known enlistments" under version control, then you could s/enlistments/entitlements/ I presume? Within my org we speak of "enlistments" as subset of the tree that you plan to work on. For example, you might enlist in the "file system" portion of the tree or in the "device drivers" portion. If the Makefiles have good partitioning, you should only need one of the above portions to do productive work within a feature area. I'm not sure what you mean by "entitlements". just reference one by : during your clone. The server can lookup that blob and just use it. git clone --filter=sparse:oid=master:templates/bar URL And then the server will filter-out the unwanted blobs during the clone. (The current version only filters blobs; you still get full commits and trees. That will be revisited later.) I'm for the idea that only the in-heirachy trees should be sent. It should also be possible that the server replies that it is only sending a narrow clone, with the given (accessible?) spec. I do want to extend this to have unneeded tree filtering too. It is just not in this version. On the client side, the partial clone installs local config settings into the repo so that subsequent fetches default to the same filter criteria as used in the clone. I don't currently have provision to send a full sparse-checkout specification to the server during a clone or fetch. That seemed like too much to try to squeeze into the protocols. We can revisit this later if there is interest, but it wasn't critical for the initial phase. Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the server admin / repo maintainer if they don't have write access. But there cou
Re: How hard would it be to implement sparse fetching/pulling?
From: "Jeff Hostetler" Sent: Friday, December 01, 2017 5:23 PM On 11/30/2017 6:43 PM, Philip Oakley wrote: From: "Vitaly Arbuzov" [...] comments below.. On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: [...] I have, for separate reasons been _thinking_ about the issue ($dayjob is in defence, so a similar partition would be useful). The changes would almost certainly need to be server side (as well as client side), as it is the server that decides what is sent over the wire in the pack files, which would need to be a 'narrow' pack file. Yes, there will need to be both client and server changes. In the current 3 part patch series, the client sends a "filter_spec" to the server as part of the fetch-pack/upload-pack protocol. If the server chooses to honor it, upload-pack passes the filter_spec to pack-objects to build an "incomplete" packfile omitting various objects (currently blobs). Proprietary servers will need similar changes to support this feature. Discussing this feature in the context of the defense industry makes me a little nervous. (I used to be in that area.) I'm viewing the desire for codebase partitioning from a soft layering of risk view (perhaps a more UK than USA approach ;-) What we have in the code so far may be a nice start, but probably doesn't have the assurances that you would need for actual deployment. But it's a start True. I need to get some of my collegues more engaged... If we had such a feature then all we would need on top is a separate tool that builds the right "sparse" scope for the workspace based on paths that developer wants to work on. In the world where more and more companies are moving towards large monorepos this improvement would provide a good way of scaling git to meet this demand. The 'companies' problem is that it tends to force a client-server, always-on on-line mentality. I'm also wanting the original DVCS off-line capability to still be available, with _user_ control, in a generic sense, of what they have locally available (including files/directories they have not yet looked at, but expect to have. IIUC Jeff's work is that on-line view, without the off-line capability. I'd commented early in the series at [1,2,3]. Yes, this does tend to lead towards an always-online mentality. However, there are 2 parts: [a] dynamic object fetching for missing objects, such as during a random command like diff or blame or merge. We need this regardless of usage -- because we can't always predict (or dry-run) every command the user might run in advance. Making something "useful" happen here when off-line is an obvious goal. [b] batch fetch mode, such as using partial-fetch to match your sparse-checkout so that you always have the blobs of interest to you. And assuming you don't wander outside of this subset of the tree, you should be able to work offline as usual. If you can work within the confines of [b], you wouldn't need to always be online. I feel this is the area that does need ensure a capability to avoid any perception of the much maligned 'Embrace, extend, and extinguish' by accidental lockout. I don't think this should be viewed as a type of sparse checkout - it's just a checkout of what you have (under the hood it could use the same code though). We might also add a part [c] with explicit commands to back-fill or alter your incomplete view of the ODB (as I explained in response to the "git diff " comment later in this thread. At its core, my idea was to use the object store to hold markers for the 'not yet fetched' objects (mainly trees and blobs). These would be in a known fixed format, and have the same effect (conceptually) as the sub-module markers - they _confirm_ the oid, yet say 'not here, try elsewhere'. We do have something like this. Jonathan can explain better than I, but basically, we denote possibly incomplete packfiles from partial clones and fetches as "promisor" and have special rules in the code to assert that a missing blob referenced from a "promisor" packfile is OK and can be fetched later if necessary from the "promising" remote. The remote interaction is one area that may need thought, especially in a triangle workflow, of which there are a few. The main problem with markers or other lists of missing objects is that it has scale problems for large repos. Suppose I have 100M blobs in my repo. If I do a blob:none clone, I'd have 100M missing blobs that would need tracking. If I then do a batch fetch of the blobs needed to do a sparse checkout of HEAD, I'd have to remove those entries from
Re: How hard would it be to implement sparse fetching/pulling?
Hi Jonathan, Thanks for the outline. It has help clarify some points and see the very similar alignments. The one thing I wasn't clear about is the "promised" objects/remote. Is that "promisor" remote a fixed entity, or could it be one of many remotes that could be a "provider"? (sort of like fetching sub-modules...) Philip From: "Jonathan Nieder" Sent: Friday, December 01, 2017 2:51 AM Hi Vitaly, Vitaly Arbuzov wrote: I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. I think one thing this thread is pointing to is a lack of overview documentation about how the 'partial clone' series currently works. The basic components are: 1. extending git protocol to (1) allow fetching only a subset of the objects reachable from the commits being fetched and (2) later, going back and fetching the objects that were left out. We've also discussed some other protocol changes, e.g. to allow obtaining the sizes of un-fetched objects without fetching the objects themselves 2. extending git's on-disk format to allow having some objects not be present but only be "promised" to be obtainable from a remote repository. When running a command that requires those objects, the user can choose to have it either (a) error out ("airplane mode") or (b) fetch the required objects. It is still possible to work fully locally in such a repo, make changes, get useful results out of "git fsck", etc. It is kind of similar to the existing "shallow clone" feature, except that there is a more straightforward way to obtain objects that are outside the "shallow" clone when needed on demand. 3. improving everyday commands to require fewer objects. For example, if I run "git log -p", then I way to see the history of most files but I don't necessarily want to download large binary files just to print 'Binary files differ' for them. And by the same token, we might want to have a mode for commands like "git log -p" to default to restricting to a particular directory, instead of downloading files outside that directory. There are some fundamental changes to make in this category --- e.g. modifying the index format to not require entries for files outside the sparse checkout, to avoid having to download the trees for them. The overall goal is to make git scale better. The existing patches do (1) and (2), though it is possible to do more in those categories. :) We have plans to work on (3) as well. These are overall changes that happen at a fairly low level in git. They mostly don't require changes command-by-command. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
From: "Jeff Hostetler" Sent: Friday, December 01, 2017 2:30 PM On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote: I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. With the current patches (parts 1,2,3) we can pass a blob-ish to the server during a clone that refers to a sparse-checkout specification. I hadn't appreciated this capability. I see it as important, and should be available both ways, so that a .gitNarrow spec can be imposed from the server side, as well as by the requester. It could also be used to assist in the 'precious/secret' blob problem, so that AWS keys are never pushed, nor available for fetching! There's a bit of a chicken-n-egg problem getting things set up. So if we assume your team would create a series of "known enlistments" under version control, then you could s/enlistments/entitlements/ I presume? just reference one by : during your clone. The server can lookup that blob and just use it. git clone --filter=sparse:oid=master:templates/bar URL And then the server will filter-out the unwanted blobs during the clone. (The current version only filters blobs; you still get full commits and trees. That will be revisited later.) I'm for the idea that only the in-heirachy trees should be sent. It should also be possible that the server replies that it is only sending a narrow clone, with the given (accessible?) spec. On the client side, the partial clone installs local config settings into the repo so that subsequent fetches default to the same filter criteria as used in the clone. I don't currently have provision to send a full sparse-checkout specification to the server during a clone or fetch. That seemed like too much to try to squeeze into the protocols. We can revisit this later if there is interest, but it wasn't critical for the initial phase. Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the server admin / repo maintainer if they don't have write access. But there could still be the catch-22 - maybe one starts with a toptree> : pair to define an origin point (it's not as refined as a .gitNarrow spec file, but is definative). The toptree option could even allow sub-tree clones.. maybe.. 2. Add a file and push changes. Preconditions: all steps above followed. touch src/bar/baz.txt && git add -A && git commit -m "added a file" git push origin master Expected results: changes are pushed to remote. I don't believe partial clone and/or partial fetch will cause any changes for push. I suspect that pushes could be rejected if the user 'pretends' to modify files or trees outside their area. It does need the user to be able to spoof part of a tree they don't have, so an upstream / remote would immediatly know it was a spoof but locally the narrow clone doesn't have enough detail about the 'bad' oid. It would be right to reject such attempts! 3. Clone a repo with a sparse list as a filter. Preconditions: same as for #1 Actions: echo "src/bar" > /tmp/blah-sparse-checkout git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be the only command that would requires specific option key being passed. Expected results: same as for #1 plus /tmp/blah-sparse-checkout is copied into .git/info/sparse-checkout I presume clone and fetch are treated equivalently here. There are 2 independent concepts here: clone and checkout. Currently, there isn't any automatic linkage of the partial clone to the sparse-checkout settings, so you could do something like this: I see an implicit link that clearly one cannot checkout (inflate/populate) a file/directory that one does not have in the object store. But that does not imply the reverse linkage. The regular sparse checkout should be available independently of the local clone being a narrow one. git clone --no-checkout --filter=sparse:oid=master:templates/bar URL git cat-file ... templates/bar >.git/info/sparse-checkout git config core.sparsecheckout true git checkout ... I've been focused on the clone/fetch issues and have not looked into the automation to couple them. I foresee that large files and certain files need to be filterable for fetch-clone, and that might not be (backward) compatible with the sparse-checkout. 4. Sho
Re: How hard would it be to implement sparse fetching/pulling?
From: "Vitaly Arbuzov" Sent: Friday, December 01, 2017 1:27 AM Jonathan, thanks for references, that is super helpful, I will follow your suggestions. Philip, I agree that keeping original DVCS off-line capability is an important point. Ideally this feature should work even with remotes that are located on the local disk. And with other any other remote. (even to the extent that the other remote may indicate it has no capability, sorry, go away..) E.g. One ought to be able to have/create a Github narrow fork of only the git.git/Documenation repo, and interact with that. (how much nicer if it was git.git/Documenation/ManPages/ to ease the exclusion of RelNotes/, howto/ and technical/ ) Which part of Jeff's work do you think wouldn't work offline after repo initialization is done and sparse fetch is performed? All the stuff that I've seen seems to be quite usable without GVFS. I think it's that initial download that may be different, and what is expected of it. In my case, one may never connect to that server again, yet still be able to work both off-line and with other remotes (push and pull as per capabilities). Below I note that I'd only fetch the needed trees, not all of them. Also one needs to fetch a complete (pre-defined) subset, rather than an on-demand subset. I'm not sure if we need to store markers/tombstones on the client, what problem does it solve? The part that the markers hopes to solve is the part that I hadn't said, that they should also show in the work tree so that users can see what is missing and where. Importantly I would also trim the directory (tree) structure so only the direct heirachy of those files the user sees are visible, though at each level they would see side directory names (which are embedded in the heirachical tree objects). (IIUC Jeff H's scheme downloads *all* trees, not just a few) It would mean that users can create a complete fresh tree and commit that can be merged and picked onto the usptream tree from the _directory worktree alone_, because the oid's of all the parts are listed in the worktree. The actual objects for the missing oids being available in the appropriate upstream. It also means the index can be deleted, and with only the local narrow pack files and the current worktree the index can be recreated at the current sparseness level. (I'm hoping I've understood the dispersement of data between index and narrow packs corrrectly here ;-) -- Philip On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley wrote: From: "Vitaly Arbuzov" Found some details here: https://github.com/jeffhostetler/git/pull/3 Looking at commits I see that you've done a lot of work already, including packing, filtering, fetching, cloning etc. What are some areas that aren't complete yet? Do you need any help with implementation? comments below.. On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: Hi guys, I'm looking for ways to improve fetch/pull/clone time for large git (mono)repositories with unrelated source trees (that span across multiple services). I've found sparse checkout approach appealing and helpful for most of client-side operations (e.g. status, reset, commit, etc.) The problem is that there is no feature like sparse fetch/pull in git, this means that ALL objects in unrelated trees are always fetched. It may take a lot of time for large repositories and results in some practical scalability limits for git. This forced some large companies like Facebook and Google to move to Mercurial as they were unable to improve client-side experience with git while Microsoft has developed GVFS, which seems to be a step back to CVCS world. I want to get a feedback (from more experienced git users than I am) on what it would take to implement sparse fetching/pulling. (Downloading only objects related to the sparse-checkout list) Are there any issues with missing hashes? Are there any fundamental problems why it can't be done? Can we get away with only client-side changes or would it require special features on the server side? I have, for separate reasons been _thinking_ about the issue ($dayjob is in defence, so a similar partition would be useful). The changes would almost certainly need to be server side (as well as client side), as it is the server that decides what is sent over the wire in the pack files, which would need to be a 'narrow' pack file. If we had such a feature then all we would need on top is a separate tool that builds the right "sparse" scope for the workspace based on paths that developer wants to work on. In the world where more and more companies are mov
Re: How hard would it be to implement sparse fetching/pulling?
Jeff Hostetler wrote: > On 11/30/2017 6:43 PM, Philip Oakley wrote: >> The 'companies' problem is that it tends to force a client-server, always-on >> on-line mentality. I'm also wanting the original DVCS off-line capability to >> still be available, with _user_ control, in a generic sense, of what they >> have locally available (including files/directories they have not yet looked >> at, but expect to have. IIUC Jeff's work is that on-line view, without the >> off-line capability. >> >> I'd commented early in the series at [1,2,3]. > > Yes, this does tend to lead towards an always-online mentality. > However, there are 2 parts: > [a] dynamic object fetching for missing objects, such as during a > random command like diff or blame or merge. We need this > regardless of usage -- because we can't always predict (or > dry-run) every command the user might run in advance. > [b] batch fetch mode, such as using partial-fetch to match your > sparse-checkout so that you always have the blobs of interest > to you. And assuming you don't wander outside of this subset > of the tree, you should be able to work offline as usual. > If you can work within the confines of [b], you wouldn't need to > always be online. Just to amplify this: for our internal use we care a lot about disconnected usage working. So it is not like we have forgotten about this use case. > We might also add a part [c] with explicit commands to back-fill or > alter your incomplete view of the ODB Agreed, this will be a nice thing to add. [...] >> At its core, my idea was to use the object store to hold markers for the >> 'not yet fetched' objects (mainly trees and blobs). These would be in a >> known fixed format, and have the same effect (conceptually) as the >> sub-module markers - they _confirm_ the oid, yet say 'not here, try >> elsewhere'. > > We do have something like this. Jonathan can explain better than I, but > basically, we denote possibly incomplete packfiles from partial clones > and fetches as "promisor" and have special rules in the code to assert > that a missing blob referenced from a "promisor" packfile is OK and can > be fetched later if necessary from the "promising" remote. > > The main problem with markers or other lists of missing objects is > that it has scale problems for large repos. Any chance that we can get a design doc in Documentation/technical/ giving an overview of the design, with a brief "alternatives considered" section describing this kind of thing? E.g. some of the earlier descriptions like https://public-inbox.org/git/20170915134343.3814d...@twelve2.svl.corp.google.com/ https://public-inbox.org/git/cover.1506714999.git.jonathanta...@google.com/ https://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/ may help as a starting point. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
Hi, Jeff Hostetler wrote: > On 11/30/2017 3:03 PM, Jonathan Nieder wrote: >> One piece of missing functionality that looks intereseting to me: that >> series batches fetches of the missing blobs involved in a "git >> checkout" command: >> >> https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/ >> >> But if doesn't batch fetches of the missing blobs involved in a "git >> diff " command. That might be a good place to get >> your hands dirty. :) > > Jonathan Tan added code in unpack-trees to bulk fetch missing blobs > before a checkout. This is limited to the missing blobs needed for > the target commit. We need this to make checkout seamless, but it > does mean that checkout may need online access. Just to clarify: other parts of the series already fetch all missing blobs that are required for a command. What that bulk-fetch patch does is to make that more efficient, by using a single fetch request to grab all the blobs that are needed for checkout, instead of one fetch per blob. This doesn't change the online access requirement: online access is needed if and only if you don't have the required objects already available locally. For example, if at clone time you specified a sparse checkout pattern and you haven't changed that sparse checkout pattern, then online access is not needed for checkout. > I've also talked about a pre-fetch capability to bulk fetch missing > blobs in advance of some operation. You could speed up the above > diff command or back-fill all the blobs I might need before going > offline for a while. In particular, something like this seems like a very valuable thing to have when changing the sparse checkout pattern. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
On 11/30/2017 6:43 PM, Philip Oakley wrote: From: "Vitaly Arbuzov" [...] comments below.. On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: [...] I have, for separate reasons been _thinking_ about the issue ($dayjob is in defence, so a similar partition would be useful). The changes would almost certainly need to be server side (as well as client side), as it is the server that decides what is sent over the wire in the pack files, which would need to be a 'narrow' pack file. Yes, there will need to be both client and server changes. In the current 3 part patch series, the client sends a "filter_spec" to the server as part of the fetch-pack/upload-pack protocol. If the server chooses to honor it, upload-pack passes the filter_spec to pack-objects to build an "incomplete" packfile omitting various objects (currently blobs). Proprietary servers will need similar changes to support this feature. Discussing this feature in the context of the defense industry makes me a little nervous. (I used to be in that area.) What we have in the code so far may be a nice start, but probably doesn't have the assurances that you would need for actual deployment. But it's a start If we had such a feature then all we would need on top is a separate tool that builds the right "sparse" scope for the workspace based on paths that developer wants to work on. In the world where more and more companies are moving towards large monorepos this improvement would provide a good way of scaling git to meet this demand. The 'companies' problem is that it tends to force a client-server, always-on on-line mentality. I'm also wanting the original DVCS off-line capability to still be available, with _user_ control, in a generic sense, of what they have locally available (including files/directories they have not yet looked at, but expect to have. IIUC Jeff's work is that on-line view, without the off-line capability. I'd commented early in the series at [1,2,3]. Yes, this does tend to lead towards an always-online mentality. However, there are 2 parts: [a] dynamic object fetching for missing objects, such as during a random command like diff or blame or merge. We need this regardless of usage -- because we can't always predict (or dry-run) every command the user might run in advance. [b] batch fetch mode, such as using partial-fetch to match your sparse-checkout so that you always have the blobs of interest to you. And assuming you don't wander outside of this subset of the tree, you should be able to work offline as usual. If you can work within the confines of [b], you wouldn't need to always be online. We might also add a part [c] with explicit commands to back-fill or alter your incomplete view of the ODB (as I explained in response to the "git diff " comment later in this thread. At its core, my idea was to use the object store to hold markers for the 'not yet fetched' objects (mainly trees and blobs). These would be in a known fixed format, and have the same effect (conceptually) as the sub-module markers - they _confirm_ the oid, yet say 'not here, try elsewhere'. We do have something like this. Jonathan can explain better than I, but basically, we denote possibly incomplete packfiles from partial clones and fetches as "promisor" and have special rules in the code to assert that a missing blob referenced from a "promisor" packfile is OK and can be fetched later if necessary from the "promising" remote. The main problem with markers or other lists of missing objects is that it has scale problems for large repos. Suppose I have 100M blobs in my repo. If I do a blob:none clone, I'd have 100M missing blobs that would need tracking. If I then do a batch fetch of the blobs needed to do a sparse checkout of HEAD, I'd have to remove those entries from the tracking data. Not impossible, but not speedy either. The comaprison with submodules mean there is the same chance of de-synchronisation with triangular and upstream servers, unless managed. The server side, as noted, will need to be included as it is the one that decides the pack file. Options for a server management are: - "I accept narrow packs?" No; yes - "I serve narrow packs?" No; yes. - "Repo completeness checks on reciept": (must be complete) || (allow narrow to nothing). we have new config settings for the server to allow/reject partial clones. and we have code in fsck/gc to handle these incomplete packfiles. For server farms (e.g. Github..) the settings could be global, or by repo. (note that the completeness requirement and narrow reciept option
Re: How hard would it be to implement sparse fetching/pulling?
On 11/30/2017 3:03 PM, Jonathan Nieder wrote: Hi Vitaly, Vitaly Arbuzov wrote: Found some details here: https://github.com/jeffhostetler/git/pull/3 Looking at commits I see that you've done a lot of work already, including packing, filtering, fetching, cloning etc. What are some areas that aren't complete yet? Do you need any help with implementation? That's a great question! I've filed https://crbug.com/git/2 to track this project. Feel free to star it to get updates there, or to add updates of your own. Thanks! As described at https://crbug.com/git/2#c1, currently there are three patch series for which review would be very welcome. Building on top of them is welcome as well. Please make sure to coordinate with jeffh...@microsoft.com and jonathanta...@google.com (e.g. through the bug tracker or email). One piece of missing functionality that looks intereseting to me: that series batches fetches of the missing blobs involved in a "git checkout" command: https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/ But if doesn't batch fetches of the missing blobs involved in a "git diff " command. That might be a good place to get your hands dirty. :) Jonathan Tan added code in unpack-trees to bulk fetch missing blobs before a checkout. This is limited to the missing blobs needed for the target commit. We need this to make checkout seamless, but it does mean that checkout may need online access. I've also talked about a pre-fetch capability to bulk fetch missing blobs in advance of some operation. You could speed up the above diff command or back-fill all the blobs I might need before going offline for a while. You can use the options that were added to rev-list to help with this. For example: git rev-list --objects [--filter=] --missing=print git rev-list --objects [--filter=] --missing=print .. And then pipe that into a "git fetch-pack --stdin". You might experiment with this. Thanks, Jonathan Thanks, Jeff
Re: How hard would it be to implement sparse fetching/pulling?
On 11/30/2017 12:44 PM, Vitaly Arbuzov wrote: Found some details here: https://github.com/jeffhostetler/git/pull/3 Looking at commits I see that you've done a lot of work already, including packing, filtering, fetching, cloning etc. What are some areas that aren't complete yet? Do you need any help with implementation? Sure. Extra hands are always welcome. Jonathan Tan and I have been working on this together. Our V5 is on the mailing now. We have privately exchanged some commits that I hope to push up as a V6 today or Monday. As for how to help, I'll have to think about that a bit. Without knowing your experience level in the code or your availability, it is hard to pick something specific right now. As a first step, build my core/pc5_p3 branch and try using partial clone/fetch between local repos. You can look at the tests we added (t0410, t5317, t5616, t6112) to see sample setups using a local pair of repos. Then try actually using the partial clone repo for actual work (dogfooding) and see how it falls short of your expectations. You might try duplicating the above tests to use a local "git daemon" serving the remote and do partial clones using localhost URLs rather than file:// URLs. That would exercise the transport differently. The t5616 test has the start of some end-to-end tests that try combine multiple steps -- such as do a partial clone with no blobs and then run blame on a file. You could extend that with more tests that test odd combinations of commands and confirm that we can handle missing blobs in different scenarios. Since you've expressed an interest in sparse-checkout and having a complete end-to-end experience, you might also experiment with adapting the above tests to use the sparse filter (--filter=sparse:oid=) instead of blobs:none or blobs:limit. See where that takes you and add tests as you see fit. The goal being to get tests in place that match the usage you want to see (even if they fail) and see what that looks like. I know it is not as glamorous as adding new functionality, but it would help get you up-to-speed on the code and we do need additional tests. Thanks Jeff
Re: How hard would it be to implement sparse fetching/pulling?
On 11/30/2017 12:01 PM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? There are no summary docs in a traditional sense. The patch series does have updated docs which show the changes to some of the commands and protocols. I would start there. Jeff
Re: How hard would it be to implement sparse fetching/pulling?
On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote: I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. With the current patches (parts 1,2,3) we can pass a blob-ish to the server during a clone that refers to a sparse-checkout specification. There's a bit of a chicken-n-egg problem getting things set up. So if we assume your team would create a series of "known enlistments" under version control, then you could just reference one by : during your clone. The server can lookup that blob and just use it. git clone --filter=sparse:oid=master:templates/bar URL And then the server will filter-out the unwanted blobs during the clone. (The current version only filters blobs; you still get full commits and trees. That will be revisited later.) On the client side, the partial clone installs local config settings into the repo so that subsequent fetches default to the same filter criteria as used in the clone. I don't currently have provision to send a full sparse-checkout specification to the server during a clone or fetch. That seemed like too much to try to squeeze into the protocols. We can revisit this later if there is interest, but it wasn't critical for the initial phase. 2. Add a file and push changes. Preconditions: all steps above followed. touch src/bar/baz.txt && git add -A && git commit -m "added a file" git push origin master Expected results: changes are pushed to remote. I don't believe partial clone and/or partial fetch will cause any changes for push. 3. Clone a repo with a sparse list as a filter. Preconditions: same as for #1 Actions: echo "src/bar" > /tmp/blah-sparse-checkout git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be the only command that would requires specific option key being passed. Expected results: same as for #1 plus /tmp/blah-sparse-checkout is copied into .git/info/sparse-checkout There are 2 independent concepts here: clone and checkout. Currently, there isn't any automatic linkage of the partial clone to the sparse-checkout settings, so you could do something like this: git clone --no-checkout --filter=sparse:oid=master:templates/bar URL git cat-file ... templates/bar >.git/info/sparse-checkout git config core.sparsecheckout true git checkout ... I've been focused on the clone/fetch issues and have not looked into the automation to couple them. 4. Showing log for sparsely cloned repo. Preconditions: #3 is followed Actions: git log Expected results: recent changes that affect src/bar tree. If I understand your meaning, log would only show changes within the sparse subset of the tree. This is not on my radar for partial clone/fetch. It would be a nice feature to have, but I think it would be better to think about it from the point of view of sparse-checkout rather than clone. 5. Showing diff. Preconditions: #3 is followed Actions: git diff HEAD^ HEAD Expected results: changes from the most recent commit affecting src/bar folder are shown. Notes: this can be tricky operation as filtering must be done to remove results from unrelated subtrees. I don't have any plan for this and I don't think it fits within the scope of clone/fetch. I think this too would be a sparse-checkout feature. *Note that I intentionally didn't mention use cases that are related to filtering by blob size as I think we should logically consider them as a separate, although related, feature. I've grouped blob-size and sparse filter together for the purposes of clone/fetch since the basic mechanisms (filtering, transport, and missing object handling) are the same for both. They do lead to different end-uses, but that is above my level here. What do you think about these examples above? Is that something that more-or-less fits into current development? Are there other important flows that I've missed? These are all good ideas and it is good to have someone else who wants to use partial+sparse thinking about it and looking for gaps as we try to make a complete end-to-end feature. -Vitaly Thanks Jeff
Re: How hard would it be to implement sparse fetching/pulling?
Makes sense, I think this perfectly aligns with our needs too. Let me dive deeper into those patches and previous discussions, that you've kindly shared above, so I better understand details. I'm very excited about what you guys already did, it's a big deal for the community! On Thu, Nov 30, 2017 at 6:51 PM, Jonathan Nieder wrote: > Hi Vitaly, > > Vitaly Arbuzov wrote: > >> I think it would be great if we high level agree on desired user >> experience, so let me put a few possible use cases here. > > I think one thing this thread is pointing to is a lack of overview > documentation about how the 'partial clone' series currently works. > The basic components are: > > 1. extending git protocol to (1) allow fetching only a subset of the > objects reachable from the commits being fetched and (2) later, > going back and fetching the objects that were left out. > > We've also discussed some other protocol changes, e.g. to allow > obtaining the sizes of un-fetched objects without fetching the > objects themselves > > 2. extending git's on-disk format to allow having some objects not be > present but only be "promised" to be obtainable from a remote > repository. When running a command that requires those objects, > the user can choose to have it either (a) error out ("airplane > mode") or (b) fetch the required objects. > > It is still possible to work fully locally in such a repo, make > changes, get useful results out of "git fsck", etc. It is kind of > similar to the existing "shallow clone" feature, except that there > is a more straightforward way to obtain objects that are outside > the "shallow" clone when needed on demand. > > 3. improving everyday commands to require fewer objects. For > example, if I run "git log -p", then I way to see the history of > most files but I don't necessarily want to download large binary > files just to print 'Binary files differ' for them. > > And by the same token, we might want to have a mode for commands > like "git log -p" to default to restricting to a particular > directory, instead of downloading files outside that directory. > > There are some fundamental changes to make in this category --- > e.g. modifying the index format to not require entries for files > outside the sparse checkout, to avoid having to download the > trees for them. > > The overall goal is to make git scale better. > > The existing patches do (1) and (2), though it is possible to do more > in those categories. :) We have plans to work on (3) as well. > > These are overall changes that happen at a fairly low level in git. > They mostly don't require changes command-by-command. > > Thanks, > Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
Hi Vitaly, Vitaly Arbuzov wrote: > I think it would be great if we high level agree on desired user > experience, so let me put a few possible use cases here. I think one thing this thread is pointing to is a lack of overview documentation about how the 'partial clone' series currently works. The basic components are: 1. extending git protocol to (1) allow fetching only a subset of the objects reachable from the commits being fetched and (2) later, going back and fetching the objects that were left out. We've also discussed some other protocol changes, e.g. to allow obtaining the sizes of un-fetched objects without fetching the objects themselves 2. extending git's on-disk format to allow having some objects not be present but only be "promised" to be obtainable from a remote repository. When running a command that requires those objects, the user can choose to have it either (a) error out ("airplane mode") or (b) fetch the required objects. It is still possible to work fully locally in such a repo, make changes, get useful results out of "git fsck", etc. It is kind of similar to the existing "shallow clone" feature, except that there is a more straightforward way to obtain objects that are outside the "shallow" clone when needed on demand. 3. improving everyday commands to require fewer objects. For example, if I run "git log -p", then I way to see the history of most files but I don't necessarily want to download large binary files just to print 'Binary files differ' for them. And by the same token, we might want to have a mode for commands like "git log -p" to default to restricting to a particular directory, instead of downloading files outside that directory. There are some fundamental changes to make in this category --- e.g. modifying the index format to not require entries for files outside the sparse checkout, to avoid having to download the trees for them. The overall goal is to make git scale better. The existing patches do (1) and (2), though it is possible to do more in those categories. :) We have plans to work on (3) as well. These are overall changes that happen at a fairly low level in git. They mostly don't require changes command-by-command. Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. 2. Add a file and push changes. Preconditions: all steps above followed. touch src/bar/baz.txt && git add -A && git commit -m "added a file" git push origin master Expected results: changes are pushed to remote. 3. Clone a repo with a sparse list as a filter. Preconditions: same as for #1 Actions: echo "src/bar" > /tmp/blah-sparse-checkout git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be the only command that would requires specific option key being passed. Expected results: same as for #1 plus /tmp/blah-sparse-checkout is copied into .git/info/sparse-checkout 4. Showing log for sparsely cloned repo. Preconditions: #3 is followed Actions: git log Expected results: recent changes that affect src/bar tree. 5. Showing diff. Preconditions: #3 is followed Actions: git diff HEAD^ HEAD Expected results: changes from the most recent commit affecting src/bar folder are shown. Notes: this can be tricky operation as filtering must be done to remove results from unrelated subtrees. *Note that I intentionally didn't mention use cases that are related to filtering by blob size as I think we should logically consider them as a separate, although related, feature. What do you think about these examples above? Is that something that more-or-less fits into current development? Are there other important flows that I've missed? -Vitaly On Thu, Nov 30, 2017 at 5:27 PM, Vitaly Arbuzov wrote: > Jonathan, thanks for references, that is super helpful, I will follow > your suggestions. > > Philip, I agree that keeping original DVCS off-line capability is an > important point. Ideally this feature should work even with remotes > that are located on the local disk. > Which part of Jeff's work do you think wouldn't work offline after > repo initialization is done and sparse fetch is performed? All the > stuff that I've seen seems to be quite usable without GVFS. > I'm not sure if we need to store markers/tombstones on the client, > what problem does it solve? > > On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley wrote: >> From: "Vitaly Arbuzov" >>> >>> Found some details here: https://github.com/jeffhostetler/git/pull/3 >>> >>> Looking at commits I see that you've done a lot of work already, >>> including packing, filtering, fetching, cloning etc. >>> What are some areas that aren't complete yet? Do you need any help >>> with implementation? >>> >> >> comments below.. >> >>> >>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: > > > > On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: >> >> >> Hi guys, >> >> I'm looking for ways to improve fetch/pull/clone time for large git >> (mono)repositories with unrelated source trees (that span across >> multiple services). >> I've found sparse checkout approach appealing and helpful for most of >> client-side operations (e.g. status, reset, commit, etc.) >> The problem is that there is no feature like sparse fetch/pull in git, >> this means that ALL objects in unrelated trees are always fetched. >> It may take a lot of time for large repositories and results in some >> practical scalability limits for git. >> This forced some large companies like Facebook and Google to move to >> Mercurial as they were unable to improve client-side experience with >> git while Microsoft has developed GVFS, which seems to be a step back >> to CVCS world. >> >> I want to get a feedback (from more experienced git users than I am) >> on what it would take to implement sparse fetching/pulling. >> (Downloading only objects related to the sparse-checkout list) >> Are there any issues with missing hashes? >> Are there any fundamental problems why it can't be done? >> Can we get away with only client-side changes or would it require >> special features on the server side? >> >> >> I have, for separate reason
Re: How hard would it be to implement sparse fetching/pulling?
Jonathan, thanks for references, that is super helpful, I will follow your suggestions. Philip, I agree that keeping original DVCS off-line capability is an important point. Ideally this feature should work even with remotes that are located on the local disk. Which part of Jeff's work do you think wouldn't work offline after repo initialization is done and sparse fetch is performed? All the stuff that I've seen seems to be quite usable without GVFS. I'm not sure if we need to store markers/tombstones on the client, what problem does it solve? On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley wrote: > From: "Vitaly Arbuzov" >> >> Found some details here: https://github.com/jeffhostetler/git/pull/3 >> >> Looking at commits I see that you've done a lot of work already, >> including packing, filtering, fetching, cloning etc. >> What are some areas that aren't complete yet? Do you need any help >> with implementation? >> > > comments below.. > >> >> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: >>> >>> Hey Jeff, >>> >>> It's great, I didn't expect that anyone is actively working on this. >>> I'll check out your branch, meanwhile do you have any design docs that >>> describe these changes or can you define high level goals that you >>> want to achieve? >>> >>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler >>> wrote: On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: > > > Hi guys, > > I'm looking for ways to improve fetch/pull/clone time for large git > (mono)repositories with unrelated source trees (that span across > multiple services). > I've found sparse checkout approach appealing and helpful for most of > client-side operations (e.g. status, reset, commit, etc.) > The problem is that there is no feature like sparse fetch/pull in git, > this means that ALL objects in unrelated trees are always fetched. > It may take a lot of time for large repositories and results in some > practical scalability limits for git. > This forced some large companies like Facebook and Google to move to > Mercurial as they were unable to improve client-side experience with > git while Microsoft has developed GVFS, which seems to be a step back > to CVCS world. > > I want to get a feedback (from more experienced git users than I am) > on what it would take to implement sparse fetching/pulling. > (Downloading only objects related to the sparse-checkout list) > Are there any issues with missing hashes? > Are there any fundamental problems why it can't be done? > Can we get away with only client-side changes or would it require > special features on the server side? > > > I have, for separate reasons been _thinking_ about the issue ($dayjob is in > defence, so a similar partition would be useful). > > The changes would almost certainly need to be server side (as well as client > side), as it is the server that decides what is sent over the wire in the > pack files, which would need to be a 'narrow' pack file. > > If we had such a feature then all we would need on top is a separate > tool that builds the right "sparse" scope for the workspace based on > paths that developer wants to work on. > > In the world where more and more companies are moving towards large > monorepos this improvement would provide a good way of scaling git to > meet this demand. > > > The 'companies' problem is that it tends to force a client-server, always-on > on-line mentality. I'm also wanting the original DVCS off-line capability to > still be available, with _user_ control, in a generic sense, of what they > have locally available (including files/directories they have not yet looked > at, but expect to have. IIUC Jeff's work is that on-line view, without the > off-line capability. > > I'd commented early in the series at [1,2,3]. > > > At its core, my idea was to use the object store to hold markers for the > 'not yet fetched' objects (mainly trees and blobs). These would be in a > known fixed format, and have the same effect (conceptually) as the > sub-module markers - they _confirm_ the oid, yet say 'not here, try > elsewhere'. > > The comaprison with submodules mean there is the same chance of > de-synchronisation with triangular and upstream servers, unless managed. > > The server side, as noted, will need to be included as it is the one that > decides the pack file. > > Options for a server management are: > > - "I accept narrow packs?" No; yes > > - "I serve narrow packs?" No; yes. > > - "Repo completeness checks on reciept": (must be complete) || (allow narrow > to nothing). > > For server farms (e.g. Github..) the settings could be global, or by repo. > (note that the completeness requirement and narrow reciept option are not > incompatible - the recipient server can reject the pack from a narrow > subordinate as incomplete - see below) > > * Marking of 'missing' objects in the local obj
Re: How hard would it be to implement sparse fetching/pulling?
From: "Vitaly Arbuzov" Found some details here: https://github.com/jeffhostetler/git/pull/3 Looking at commits I see that you've done a lot of work already, including packing, filtering, fetching, cloning etc. What are some areas that aren't complete yet? Do you need any help with implementation? comments below.. On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: Hi guys, I'm looking for ways to improve fetch/pull/clone time for large git (mono)repositories with unrelated source trees (that span across multiple services). I've found sparse checkout approach appealing and helpful for most of client-side operations (e.g. status, reset, commit, etc.) The problem is that there is no feature like sparse fetch/pull in git, this means that ALL objects in unrelated trees are always fetched. It may take a lot of time for large repositories and results in some practical scalability limits for git. This forced some large companies like Facebook and Google to move to Mercurial as they were unable to improve client-side experience with git while Microsoft has developed GVFS, which seems to be a step back to CVCS world. I want to get a feedback (from more experienced git users than I am) on what it would take to implement sparse fetching/pulling. (Downloading only objects related to the sparse-checkout list) Are there any issues with missing hashes? Are there any fundamental problems why it can't be done? Can we get away with only client-side changes or would it require special features on the server side? I have, for separate reasons been _thinking_ about the issue ($dayjob is in defence, so a similar partition would be useful). The changes would almost certainly need to be server side (as well as client side), as it is the server that decides what is sent over the wire in the pack files, which would need to be a 'narrow' pack file. If we had such a feature then all we would need on top is a separate tool that builds the right "sparse" scope for the workspace based on paths that developer wants to work on. In the world where more and more companies are moving towards large monorepos this improvement would provide a good way of scaling git to meet this demand. The 'companies' problem is that it tends to force a client-server, always-on on-line mentality. I'm also wanting the original DVCS off-line capability to still be available, with _user_ control, in a generic sense, of what they have locally available (including files/directories they have not yet looked at, but expect to have. IIUC Jeff's work is that on-line view, without the off-line capability. I'd commented early in the series at [1,2,3]. At its core, my idea was to use the object store to hold markers for the 'not yet fetched' objects (mainly trees and blobs). These would be in a known fixed format, and have the same effect (conceptually) as the sub-module markers - they _confirm_ the oid, yet say 'not here, try elsewhere'. The comaprison with submodules mean there is the same chance of de-synchronisation with triangular and upstream servers, unless managed. The server side, as noted, will need to be included as it is the one that decides the pack file. Options for a server management are: - "I accept narrow packs?" No; yes - "I serve narrow packs?" No; yes. - "Repo completeness checks on reciept": (must be complete) || (allow narrow to nothing). For server farms (e.g. Github..) the settings could be global, or by repo. (note that the completeness requirement and narrow reciept option are not incompatible - the recipient server can reject the pack from a narrow subordinate as incomplete - see below) * Marking of 'missing' objects in the local object store, and on the wire. The missing objects are replaced by a place holder object, which used the same oid/sha1, but has a short fixed length, with content “GitNarrowObject ”. The chance that that string would actually have such an oid clash is the same as all other object hashes, so is a *safe* self-referential device. * The stored object already includes length (and inferred type), so we do know what it stands in for. Thus the local index (index file) should be able to be recreated from the object store alone (including the ‘promised / narrow / missing’ files/directory markers) * the ‘same’ as sub-modules. The potential for loss of synchronisation with a golden complete repo is just the same as for sub-modules. (We expected object/commit X here, but it’s not in the store). This could happen with a small user group who have locally narrow clones, who interact with their local narrow server for ‘backup’, and then fail to push f
Re: How hard would it be to implement sparse fetching/pulling?
Hi Vitaly, Vitaly Arbuzov wrote: > Found some details here: https://github.com/jeffhostetler/git/pull/3 > > Looking at commits I see that you've done a lot of work already, > including packing, filtering, fetching, cloning etc. > What are some areas that aren't complete yet? Do you need any help > with implementation? That's a great question! I've filed https://crbug.com/git/2 to track this project. Feel free to star it to get updates there, or to add updates of your own. As described at https://crbug.com/git/2#c1, currently there are three patch series for which review would be very welcome. Building on top of them is welcome as well. Please make sure to coordinate with jeffh...@microsoft.com and jonathanta...@google.com (e.g. through the bug tracker or email). One piece of missing functionality that looks intereseting to me: that series batches fetches of the missing blobs involved in a "git checkout" command: https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/ But if doesn't batch fetches of the missing blobs involved in a "git diff " command. That might be a good place to get your hands dirty. :) Thanks, Jonathan
Re: How hard would it be to implement sparse fetching/pulling?
Found some details here: https://github.com/jeffhostetler/git/pull/3 Looking at commits I see that you've done a lot of work already, including packing, filtering, fetching, cloning etc. What are some areas that aren't complete yet? Do you need any help with implementation? On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov wrote: > Hey Jeff, > > It's great, I didn't expect that anyone is actively working on this. > I'll check out your branch, meanwhile do you have any design docs that > describe these changes or can you define high level goals that you > want to achieve? > > On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler > wrote: >> >> >> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: >>> >>> Hi guys, >>> >>> I'm looking for ways to improve fetch/pull/clone time for large git >>> (mono)repositories with unrelated source trees (that span across >>> multiple services). >>> I've found sparse checkout approach appealing and helpful for most of >>> client-side operations (e.g. status, reset, commit, etc.) >>> The problem is that there is no feature like sparse fetch/pull in git, >>> this means that ALL objects in unrelated trees are always fetched. >>> It may take a lot of time for large repositories and results in some >>> practical scalability limits for git. >>> This forced some large companies like Facebook and Google to move to >>> Mercurial as they were unable to improve client-side experience with >>> git while Microsoft has developed GVFS, which seems to be a step back >>> to CVCS world. >>> >>> I want to get a feedback (from more experienced git users than I am) >>> on what it would take to implement sparse fetching/pulling. >>> (Downloading only objects related to the sparse-checkout list) >>> Are there any issues with missing hashes? >>> Are there any fundamental problems why it can't be done? >>> Can we get away with only client-side changes or would it require >>> special features on the server side? >>> >>> If we had such a feature then all we would need on top is a separate >>> tool that builds the right "sparse" scope for the workspace based on >>> paths that developer wants to work on. >>> >>> In the world where more and more companies are moving towards large >>> monorepos this improvement would provide a good way of scaling git to >>> meet this demand. >>> >>> PS. Please don't advice to split things up, as there are some good >>> reasons why many companies decide to keep their code in the monorepo, >>> which you can easily find online. So let's keep that part out the >>> scope. >>> >>> -Vitaly >>> >> >> >> This work is in-progress now. A short summary can be found in [1] >> of the current parts 1, 2, and 3. >> >>> * jh/object-filtering (2017-11-22) 6 commits >>> * jh/fsck-promisors (2017-11-22) 10 commits >>> * jh/partial-clone (2017-11-22) 14 commits >> >> >> [1] >> https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/ >> >> I have a branch that contains V5 all 3 parts: >> https://github.com/jeffhostetler/git/tree/core/pc5_p3 >> >> This is a WIP, so there are some rough edges >> I hope to have a V6 out before the weekend with some >> bug fixes and cleanup. >> >> Please give it a try and see if it fits your needs. >> Currently, there are filter methods to filter all blobs, >> all large blobs, and one to match a sparse-checkout >> specification. >> >> Let me know if you have any questions or problems. >> >> Thanks, >> Jeff
Re: How hard would it be to implement sparse fetching/pulling?
Hey Jeff, It's great, I didn't expect that anyone is actively working on this. I'll check out your branch, meanwhile do you have any design docs that describe these changes or can you define high level goals that you want to achieve? On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler wrote: > > > On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: >> >> Hi guys, >> >> I'm looking for ways to improve fetch/pull/clone time for large git >> (mono)repositories with unrelated source trees (that span across >> multiple services). >> I've found sparse checkout approach appealing and helpful for most of >> client-side operations (e.g. status, reset, commit, etc.) >> The problem is that there is no feature like sparse fetch/pull in git, >> this means that ALL objects in unrelated trees are always fetched. >> It may take a lot of time for large repositories and results in some >> practical scalability limits for git. >> This forced some large companies like Facebook and Google to move to >> Mercurial as they were unable to improve client-side experience with >> git while Microsoft has developed GVFS, which seems to be a step back >> to CVCS world. >> >> I want to get a feedback (from more experienced git users than I am) >> on what it would take to implement sparse fetching/pulling. >> (Downloading only objects related to the sparse-checkout list) >> Are there any issues with missing hashes? >> Are there any fundamental problems why it can't be done? >> Can we get away with only client-side changes or would it require >> special features on the server side? >> >> If we had such a feature then all we would need on top is a separate >> tool that builds the right "sparse" scope for the workspace based on >> paths that developer wants to work on. >> >> In the world where more and more companies are moving towards large >> monorepos this improvement would provide a good way of scaling git to >> meet this demand. >> >> PS. Please don't advice to split things up, as there are some good >> reasons why many companies decide to keep their code in the monorepo, >> which you can easily find online. So let's keep that part out the >> scope. >> >> -Vitaly >> > > > This work is in-progress now. A short summary can be found in [1] > of the current parts 1, 2, and 3. > >> * jh/object-filtering (2017-11-22) 6 commits >> * jh/fsck-promisors (2017-11-22) 10 commits >> * jh/partial-clone (2017-11-22) 14 commits > > > [1] > https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/ > > I have a branch that contains V5 all 3 parts: > https://github.com/jeffhostetler/git/tree/core/pc5_p3 > > This is a WIP, so there are some rough edges > I hope to have a V6 out before the weekend with some > bug fixes and cleanup. > > Please give it a try and see if it fits your needs. > Currently, there are filter methods to filter all blobs, > all large blobs, and one to match a sparse-checkout > specification. > > Let me know if you have any questions or problems. > > Thanks, > Jeff
Re: How hard would it be to implement sparse fetching/pulling?
On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: Hi guys, I'm looking for ways to improve fetch/pull/clone time for large git (mono)repositories with unrelated source trees (that span across multiple services). I've found sparse checkout approach appealing and helpful for most of client-side operations (e.g. status, reset, commit, etc.) The problem is that there is no feature like sparse fetch/pull in git, this means that ALL objects in unrelated trees are always fetched. It may take a lot of time for large repositories and results in some practical scalability limits for git. This forced some large companies like Facebook and Google to move to Mercurial as they were unable to improve client-side experience with git while Microsoft has developed GVFS, which seems to be a step back to CVCS world. I want to get a feedback (from more experienced git users than I am) on what it would take to implement sparse fetching/pulling. (Downloading only objects related to the sparse-checkout list) Are there any issues with missing hashes? Are there any fundamental problems why it can't be done? Can we get away with only client-side changes or would it require special features on the server side? If we had such a feature then all we would need on top is a separate tool that builds the right "sparse" scope for the workspace based on paths that developer wants to work on. In the world where more and more companies are moving towards large monorepos this improvement would provide a good way of scaling git to meet this demand. PS. Please don't advice to split things up, as there are some good reasons why many companies decide to keep their code in the monorepo, which you can easily find online. So let's keep that part out the scope. -Vitaly This work is in-progress now. A short summary can be found in [1] of the current parts 1, 2, and 3. * jh/object-filtering (2017-11-22) 6 commits * jh/fsck-promisors (2017-11-22) 10 commits * jh/partial-clone (2017-11-22) 14 commits [1] https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/ I have a branch that contains V5 all 3 parts: https://github.com/jeffhostetler/git/tree/core/pc5_p3 This is a WIP, so there are some rough edges I hope to have a V6 out before the weekend with some bug fixes and cleanup. Please give it a try and see if it fits your needs. Currently, there are filter methods to filter all blobs, all large blobs, and one to match a sparse-checkout specification. Let me know if you have any questions or problems. Thanks, Jeff