Re: How hard would it be to implement sparse fetching/pulling?

2017-12-05 Thread Philip Oakley

From: "Jeff Hostetler" 
Sent: Monday, December 04, 2017 3:36 PM


On 12/2/2017 11:30 AM, Philip Oakley wrote:

From: "Jeff Hostetler" 
Sent: Friday, December 01, 2017 2:30 PM

On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.


With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.


I hadn't appreciated this capability. I see it as important, and should 
be available both ways, so that a .gitNarrow spec can be imposed from the 
server side, as well as by the requester.


It could also be used to assist in the 'precious/secret' blob problem, so 
that AWS keys are never pushed, nor available for fetching!


To be honest, I've always considered partial clone/fetch as
a client-side request as a performance feature to minimize
download times and disk space requirements on the client.


Mine was a two way view where one side or other specified an extent for the 
narrow clone to achieve either the speed/space improvement or partitioning 
capability.



I've not thought of it from the "server has secrets" point
of view.


My potential for "secrets" was a little softer that some of the 'hard' 
security that is often discussed. I'm for the layered risk approach (swiss 
cheese model)


We can talk about it, but I'd like to keep it outside the
scope of the current effort.


Agreed.


 My concerns are that that is
not the appropriate mechanism to enforce MAC/DAC like security
mechanisms.  For example:
[a] The client will still receive the containing trees that
refer to the sensitive blobs, so the user can tell when
the secret blobs change -- they wouldn't have either blob,
but can tell when they are changed.  This event by itself
may or may not leak sensitive information depending on the
terms of the security policy in place.
[b] The existence of such missing blobs would tell the client
which blobs are significant and secret and allow them to
focus their attack.  It would be better if those assets
were completely hidden and not in the tree at all.
[c] The client could push a fake secret blob to replace the
valid one on the server.  You would have to audit the
server to ensure that it never accepts a push containing
a change to any secret blob.  And the server would need
an infrastructure to know about all secrets in the tree.
[d] When a secret blob does change, any local merges by the
user lack information to complete the merge -- they can't
merge the secrets and they can't be trusted to correctly
pick-ours or pick-theirs -- so their workflows are broken.
I'm not trying to blindly spread FUD here, but it is arguments
like these that make me suggest that the partial clone mechanism
is not the right vehicle for such "secret" blobs.


I'm on the 'a little security is better than no security' side, but all the 
points are valid.






There's a bit of a chicken-n-egg problem getting
things set up. So if we assume your team would create a series
of "known enlistments" under version control, then you could


s/enlistments/entitlements/ I presume?


Within my org we speak of "enlistments" as subset of the tree
that you plan to work on.  For example, you might enlist in the
"file system" portion of the tree or in the "device drivers"
portion.  If the Makefiles have good partitioning, you should
only need one of the above portions to do productive work within
a feature area.
Ah, so it's the things that have been requested by the client (I'd like to 
the enlist..)




I'm not sure what you mean by "entitlements".


It is like having the title deeds to a house - a list things you have, or 
can have. (e.g. a father saying: you can have the car on Saturday 6pm -11pm)


At the end of the day the particular lists would be the same, they guide 
what is sent.







just reference one by : during your clone. The
server can lookup that blob and just use it.

git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone. (The current version only filters blobs; you still
get full commits and trees. That will be revisited later.)


I'm for the idea that only the in-heirachy trees should be sent.
It shou

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-05 Thread Jonathan Nieder
Hi,

Jeff Hostetler wrote:
> On 12/2/2017 1:24 PM, Philip Oakley wrote:
>> From: "Jeff Hostetler" 
>> Sent: Friday, December 01, 2017 5:23 PM

>>> Discussing this feature in the context of the defense industry
>>> makes me a little nervous.  (I used to be in that area.)
>>
>> I'm viewing the desire for codebase partitioning from a soft layering
>> of risk view (perhaps a more UK than USA approach ;-)
>
> I'm not sure I know what this means or how the UK defense
> security models/policy/procedures are different from the US,
> so I can't say much here.  I'm just thinking that even if we
> get a *perfectly working* partial clone/fetch/push/etc. that
> it would not pass a security audit.  I might be wrong here
> (and I'm no expert on the subject), but I think they would
> push you towards a different solution architecture.

I'm pretty ignorant about the defense industry, but a few more
comments:

- gitolite implements some features on top of git's server code that I
  consider to be important for security.  So much so that I've been
  considering what it would take to remove the git-shell command from
  git.git and move it to the gitolite project where people would be
  better equipped to use it in an appropriate context

- in particular, git's reachability checking code could use some
  hardening/improvement.  In particular, think of edge cases like
  where someone pushes a pack with deltas referring to objects they
  should not be able to reach.

- Anyone willing to audit git code's security wins my approval.
  Please, please, audit git code and report the issues you find. :)

[...]
> Also omitting certain trees means you now (obviously) have both missing
> trees and blobs.  And both need to be dynamically or batch fetched as
> needed.  And certain operations will need multiple round trips to fully
> resolve -- fault in a tree and then fault in blobs referenced by it.

For omitting trees, we will need to modify the index format, since the
index has entries for all paths today.  That's on the roadmap but has
not been implemented yet.

Thanks,
Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-12-05 Thread Jeff Hostetler



On 12/2/2017 1:24 PM, Philip Oakley wrote:

From: "Jeff Hostetler" 
Sent: Friday, December 01, 2017 5:23 PM

On 11/30/2017 6:43 PM, Philip Oakley wrote:

[...]


Discussing this feature in the context of the defense industry
makes me a little nervous.  (I used to be in that area.)


I'm viewing the desire for codebase partitioning from a soft layering
of risk view (perhaps a more UK than USA approach ;-)


I'm not sure I know what this means or how the UK defense
security models/policy/procedures are different from the US,
so I can't say much here.  I'm just thinking that even if we
get a *perfectly working* partial clone/fetch/push/etc. that
it would not pass a security audit.  I might be wrong here
(and I'm no expert on the subject), but I think they would
push you towards a different solution architecture.





What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment.  But it's a start


True. I need to get some of my collegues more engaged...



[...]

Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
    random command like diff or blame or merge.  We need this
    regardless of usage -- because we can't always predict (or
    dry-run) every command the user might run in advance.


Making something "useful" happen here when off-line is an obvious goal.


[b] batch fetch mode, such as using partial-fetch to match your
    sparse-checkout so that you always have the blobs of interest
    to you.  And assuming you don't wander outside of this subset
    of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.


I feel this is the area that does need ensure a capability to avoid
any perception of the much maligned 'Embrace, extend, and extinguish'> by 
accidental lockout.

I don't think this should be viewed as a type of sparse checkout -
it's just a checkout of what you have (under the hood it could use
the same code though).


Right, I'm only thinking of this effort as a way to get a partial
clone and fetch that omits unneeded (or, not immediately needed)
objects for performance reasons.  There are several use scenarios
that I've discussed and sparse-checkout is one of them, but I do
not consider this to be a sparse-checkout feature.

 
[...]


The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.  Suppose I have 100M
blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking.  If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from the tracking data.  Not impossible, but not
speedy either.


** Ahhh. I see. That's a consequence of having all the trees isn't it. **

I've always thought that limiting the trees is at the heart of the Narrow 
clone/fetch problem.

OK so if you have flat, wide structures with 10k files/directories per tree 
then it's still a fair sized problem, but it should *scale logarithmically* for 
the part of the tree structure that's not being downloaded.

You never have to add a marker for a blob that you have no containing tree for. 
Nor for the tree that contained the blob's tree, all the way up to primary line 
of descent to the tree of concern. All those trees are never down loaded, there 
are few markers (.gitNarrowTree files) for those tree stubs, certainly no 100M 
missing blob markers.


Currently, the code only omits blobs.  I want to extend the current
code to have filters that also exclude unneeded trees.  That will help
address some of these size concerns, but there are still perf issues
here.



* Marking of 'missing' objects in the local object store, and on the wire.
The missing objects are replaced by a place holder object, which used the
same oid/sha1, but has a short fixed length, with content “GitNarrowObject
”. The chance that that string would actually have such an oid clash is
the same as all other object hashes, so is a *safe* self-referential device.


Again, there is a scale problem here.  If I have 100M missing blobs,
I can't afford to create 100M loose place holder files.  Or juggle
a 2GB file of missing objects on various operations.


As above, I'm also trimming the trees, so in general, there would be no missing 
 blobs, just the content of the directory one was interested in.

That's not quite true if higher level trees have blob references in them that 
are otherwise unwanted - they may each need a marker. [Or maybe a special 
single 'tree-of-blobs' marker for them all thus only one marker per tree - 
over-thinking maybe...]


Also omitting certain trees means you now (obviously) have both missing
trees and blobs.  And both need to be dynamically or batch fetched as
needed.  And certain operations will need multiple round 

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-04 Thread Jeff Hostetler



On 12/1/2017 1:24 PM, Jonathan Nieder wrote:

Jeff Hostetler wrote:

On 11/30/2017 6:43 PM, Philip Oakley wrote:



The 'companies' problem is that it tends to force a client-server, always-on
on-line mentality. I'm also wanting the original DVCS off-line capability to
still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet looked
at, but expect to have. IIUC Jeff's work is that on-line view, without the
off-line capability.

I'd commented early in the series at [1,2,3].


Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
 random command like diff or blame or merge.  We need this
 regardless of usage -- because we can't always predict (or
 dry-run) every command the user might run in advance.
[b] batch fetch mode, such as using partial-fetch to match your
 sparse-checkout so that you always have the blobs of interest
 to you.  And assuming you don't wander outside of this subset
 of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.


Just to amplify this: for our internal use we care a lot about
disconnected usage working.  So it is not like we have forgotten about
this use case.


We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB


Agreed, this will be a nice thing to add.

[...]

At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a
known fixed format, and have the same effect (conceptually) as the
sub-module markers - they _confirm_ the oid, yet say 'not here, try
elsewhere'.


We do have something like this.  Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.

The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.


Any chance that we can get a design doc in Documentation/technical/
giving an overview of the design, with a brief "alternatives
considered" section describing this kind of thing?


Yeah, I'll start one.  I have notes within the individual protocol
docs and man-pages, but no summary doc.  Thanks!



E.g. some of the earlier descriptions like
  
https://public-inbox.org/git/20170915134343.3814d...@twelve2.svl.corp.google.com/
  https://public-inbox.org/git/cover.1506714999.git.jonathanta...@google.com/
  https://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/
may help as a starting point.

Thanks,
Jonathan



Re: How hard would it be to implement sparse fetching/pulling?

2017-12-04 Thread Jeff Hostetler



On 12/2/2017 11:30 AM, Philip Oakley wrote:

From: "Jeff Hostetler" 
Sent: Friday, December 01, 2017 2:30 PM

On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.


With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.


I hadn't appreciated this capability. I see it as important, and should be 
available both ways, so that a .gitNarrow spec can be imposed from the server 
side, as well as by the requester.

It could also be used to assist in the 'precious/secret' blob problem, so that 
AWS keys are never pushed, nor available for fetching!


To be honest, I've always considered partial clone/fetch as
a client-side request as a performance feature to minimize
download times and disk space requirements on the client.
I've not thought of it from the "server has secrets" point
of view.

We can talk about it, but I'd like to keep it outside the
scope of the current effort.  My concerns are that that is
not the appropriate mechanism to enforce MAC/DAC like security
mechanisms.  For example:
[a] The client will still receive the containing trees that
refer to the sensitive blobs, so the user can tell when
the secret blobs change -- they wouldn't have either blob,
but can tell when they are changed.  This event by itself
may or may not leak sensitive information depending on the
terms of the security policy in place.
[b] The existence of such missing blobs would tell the client
which blobs are significant and secret and allow them to
focus their attack.  It would be better if those assets
were completely hidden and not in the tree at all.
[c] The client could push a fake secret blob to replace the
valid one on the server.  You would have to audit the
server to ensure that it never accepts a push containing
a change to any secret blob.  And the server would need
an infrastructure to know about all secrets in the tree.
[d] When a secret blob does change, any local merges by the
user lack information to complete the merge -- they can't
merge the secrets and they can't be trusted to correctly
pick-ours or pick-theirs -- so their workflows are broken.
I'm not trying to blindly spread FUD here, but it is arguments
like these that make me suggest that the partial clone mechanism
is not the right vehicle for such "secret" blobs.





   There's a bit of a chicken-n-egg problem getting
things set up.  So if we assume your team would create a series
of "known enlistments" under version control, then you could


s/enlistments/entitlements/ I presume?


Within my org we speak of "enlistments" as subset of the tree
that you plan to work on.  For example, you might enlist in the
"file system" portion of the tree or in the "device drivers"
portion.  If the Makefiles have good partitioning, you should
only need one of the above portions to do productive work within
a feature area.

I'm not sure what you mean by "entitlements".




just reference one by : during your clone.  The
server can lookup that blob and just use it.

    git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone.  (The current version only filters blobs; you still
get full commits and trees.  That will be revisited later.)


I'm for the idea that only the in-heirachy trees should be sent.
It should also be possible that the server replies that it is 
only sending a narrow clone, with the given (accessible?) spec.


I do want to extend this to have unneeded tree filtering too.
It is just not in this version.





On the client side, the partial clone installs local config
settings into the repo so that subsequent fetches default to
the same filter criteria as used in the clone.


I don't currently have provision to send a full sparse-checkout
specification to the server during a clone or fetch.  That
seemed like too much to try to squeeze into the protocols.
We can revisit this later if there is interest, but it wasn't
critical for the initial phase.


Agreed. I think it should be somewhere 'visible' to the user, but could be setup by the 
server admin / repo maintainer if they don't have write access. But there cou

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-02 Thread Philip Oakley

From: "Jeff Hostetler" 
Sent: Friday, December 01, 2017 5:23 PM

On 11/30/2017 6:43 PM, Philip Oakley wrote:

From: "Vitaly Arbuzov" 

[...]

comments below..


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:

Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
wrote:



On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:

[...]




I have, for separate reasons been _thinking_ about the issue ($dayjob is 
in

defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as 
client
side), as it is the server that decides what is sent over the wire in the 
pack files, which would need to be a 'narrow' pack file.


Yes, there will need to be both client and server changes.
In the current 3 part patch series, the client sends a "filter_spec"
to the server as part of the fetch-pack/upload-pack protocol.
If the server chooses to honor it, upload-pack passes the filter_spec
to pack-objects to build an "incomplete" packfile omitting various
objects (currently blobs).  Proprietary servers will need similar
changes to support this feature.

Discussing this feature in the context of the defense industry
makes me a little nervous.  (I used to be in that area.)


I'm viewing the desire for codebase partitioning from a soft layering of 
risk view (perhaps a more UK than USA approach ;-)



What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment.  But it's a start


True. I need to get some of my collegues more engaged...





If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.


The 'companies' problem is that it tends to force a client-server, 
always-on
on-line mentality. I'm also wanting the original DVCS off-line capability 
to

still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet 
looked
at, but expect to have. IIUC Jeff's work is that on-line view, without 
the

off-line capability.

I'd commented early in the series at [1,2,3].


Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
random command like diff or blame or merge.  We need this
regardless of usage -- because we can't always predict (or
dry-run) every command the user might run in advance.


Making something "useful" happen here when off-line is an obvious goal.


[b] batch fetch mode, such as using partial-fetch to match your
sparse-checkout so that you always have the blobs of interest
to you.  And assuming you don't wander outside of this subset
of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.


I feel this is the area that does need ensure a capability to avoid any 
perception of the much maligned 'Embrace, extend, and extinguish' by 
accidental lockout.


I don't think this should be viewed as a type of sparse checkout - it's just 
a checkout of what you have (under the hood it could use the same code 
though).




We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB (as I explained in response
to the "git diff  " comment later in this thread.



At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a 
known fixed format, and have the same effect (conceptually) as the 
sub-module markers - they _confirm_ the oid, yet say 'not here, try 
elsewhere'.


We do have something like this.  Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.


The remote interaction is one area that may need thought, especially in a 
triangle workflow, of which there are a few.




The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.  Suppose I have 100M
blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking.  If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from 

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-02 Thread Philip Oakley

Hi Jonathan,

Thanks for the outline. It has help clarify some points and see the very 
similar alignments.


The one thing I wasn't clear about is the "promised" objects/remote. Is that 
"promisor" remote a fixed entity, or could it be one of many remotes that 
could be a "provider"? (sort of like fetching sub-modules...)


Philip

From: "Jonathan Nieder" 
Sent: Friday, December 01, 2017 2:51 AM

Hi Vitaly,

Vitaly Arbuzov wrote:


I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.


I think one thing this thread is pointing to is a lack of overview
documentation about how the 'partial clone' series currently works.
The basic components are:

1. extending git protocol to (1) allow fetching only a subset of the
   objects reachable from the commits being fetched and (2) later,
   going back and fetching the objects that were left out.

   We've also discussed some other protocol changes, e.g. to allow
   obtaining the sizes of un-fetched objects without fetching the
   objects themselves

2. extending git's on-disk format to allow having some objects not be
   present but only be "promised" to be obtainable from a remote
   repository.  When running a command that requires those objects,
   the user can choose to have it either (a) error out ("airplane
   mode") or (b) fetch the required objects.

   It is still possible to work fully locally in such a repo, make
   changes, get useful results out of "git fsck", etc.  It is kind of
   similar to the existing "shallow clone" feature, except that there
   is a more straightforward way to obtain objects that are outside
   the "shallow" clone when needed on demand.

3. improving everyday commands to require fewer objects.  For
   example, if I run "git log -p", then I way to see the history of
   most files but I don't necessarily want to download large binary
   files just to print 'Binary files differ' for them.

   And by the same token, we might want to have a mode for commands
   like "git log -p" to default to restricting to a particular
   directory, instead of downloading files outside that directory.

   There are some fundamental changes to make in this category ---
   e.g. modifying the index format to not require entries for files
   outside the sparse checkout, to avoid having to download the
   trees for them.

The overall goal is to make git scale better.

The existing patches do (1) and (2), though it is possible to do more
in those categories. :)  We have plans to work on (3) as well.

These are overall changes that happen at a fairly low level in git.
They mostly don't require changes command-by-command.

Thanks,
Jonathan 




Re: How hard would it be to implement sparse fetching/pulling?

2017-12-02 Thread Philip Oakley

From: "Jeff Hostetler" 
Sent: Friday, December 01, 2017 2:30 PM

On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.


With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.


I hadn't appreciated this capability. I see it as important, and should be 
available both ways, so that a .gitNarrow spec can be imposed from the 
server side, as well as by the requester.


It could also be used to assist in the 'precious/secret' blob problem, so 
that AWS keys are never pushed, nor available for fetching!



   There's a bit of a chicken-n-egg problem getting
things set up.  So if we assume your team would create a series
of "known enlistments" under version control, then you could


s/enlistments/entitlements/ I presume?


just reference one by : during your clone.  The
server can lookup that blob and just use it.

git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone.  (The current version only filters blobs; you still
get full commits and trees.  That will be revisited later.)


I'm for the idea that only the in-heirachy trees should be sent.
It should also be possible that the server replies that it is only sending a 
narrow clone, with the given (accessible?) spec.




On the client side, the partial clone installs local config
settings into the repo so that subsequent fetches default to
the same filter criteria as used in the clone.


I don't currently have provision to send a full sparse-checkout
specification to the server during a clone or fetch.  That
seemed like too much to try to squeeze into the protocols.
We can revisit this later if there is interest, but it wasn't
critical for the initial phase.

Agreed. I think it should be somewhere 'visible' to the user, but could be 
setup by the server admin / repo maintainer if they don't have write access. 
But there could still be the catch-22 - maybe one starts with a toptree> :  pair to define an origin point (it's not as refined as a 
.gitNarrow spec file, but is definative). The toptree option could even 
allow sub-tree clones.. maybe..






2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.


I don't believe partial clone and/or partial fetch will cause
any changes for push.


I suspect that pushes could be rejected if the user 'pretends' to modify 
files or trees outside their area. It does need the user to be able to spoof 
part of a tree they don't have, so an upstream / remote would immediatly 
know it was a spoof but locally the narrow clone doesn't have enough detail 
about the 'bad' oid. It would be right to reject such attempts!






3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout


I presume clone and fetch are treated equivalently here.



There are 2 independent concepts here: clone and checkout.
Currently, there isn't any automatic linkage of the partial clone to
the sparse-checkout settings, so you could do something like this:

I see an implicit link that clearly one cannot checkout (inflate/populate) a 
file/directory that one does not have in the object store. But that does not 
imply the reverse linkage. The regular sparse checkout should be available 
independently of the local clone being a narrow one.



git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
git cat-file ... templates/bar >.git/info/sparse-checkout
git config core.sparsecheckout true
git checkout ...

I've been focused on the clone/fetch issues and have not looked
into the automation to couple them.



I foresee that large files and certain files need to be filterable for 
fetch-clone, and that might not be (backward) compatible with the 
sparse-checkout.






4. Sho

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-02 Thread Philip Oakley

From: "Vitaly Arbuzov" 
Sent: Friday, December 01, 2017 1:27 AM

Jonathan, thanks for references, that is super helpful, I will follow

your suggestions.


Philip, I agree that keeping original DVCS off-line capability is an

important point. Ideally this feature should work even with remotes
that are located on the local disk.

And with other any other remote. (even to the extent that the other remote 
may indicate it has no capability, sorry, go away..)
E.g. One ought to be able to have/create a Github narrow fork of only the 
git.git/Documenation repo, and interact with that. (how much nicer if it was 
git.git/Documenation/ManPages/ to ease the exclusion of RelNotes/, howto/ 
and technical/ )



Which part of Jeff's work do you think wouldn't work offline after

repo initialization is done and sparse fetch is performed? All the
stuff that I've seen seems to be quite usable without GVFS.

I think it's that initial download that may be different, and what is 
expected of it. In my case, one may never connect to that server again, yet 
still be able to work both off-line and with other remotes (push and pull as 
per capabilities). Below I note that I'd only fetch the needed trees, not 
all of them. Also one needs to fetch a complete (pre-defined) subset, rather 
than an on-demand subset.



I'm not sure if we need to store markers/tombstones on the client,

what problem does it solve?

The part that the markers hopes to solve is the part that I hadn't said, 
that they should also show in the work tree so that users can see what is 
missing and where.


Importantly I would also trim the directory (tree) structure so only the 
direct heirachy of those files the user sees are visible, though at each 
level they would see side directory names (which are embedded in the 
heirachical tree objects). (IIUC Jeff H's scheme downloads *all* trees, not 
just a few)


It would mean that users can create a complete fresh tree and commit that 
can be merged and picked onto the usptream tree from the _directory worktree 
alone_, because the oid's of all the parts are listed in the worktree. The 
actual objects for the missing oids being available in the appropriate 
upstream.


It also means the index can be deleted, and with only the local narrow pack 
files and the current worktree the index can be recreated at the current 
sparseness level. (I'm hoping I've understood the dispersement of data 
between index and narrow packs corrrectly here ;-)


--
Philip

On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley  wrote:

From: "Vitaly Arbuzov" 


Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?



comments below..



On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:


Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
wrote:




On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:



Hi guys,

I'm looking for ways to improve fetch/pull/clone time for large git
(mono)repositories with unrelated source trees (that span across
multiple services).
I've found sparse checkout approach appealing and helpful for most of
client-side operations (e.g. status, reset, commit, etc.)
The problem is that there is no feature like sparse fetch/pull in git,
this means that ALL objects in unrelated trees are always fetched.
It may take a lot of time for large repositories and results in some
practical scalability limits for git.
This forced some large companies like Facebook and Google to move to
Mercurial as they were unable to improve client-side experience with
git while Microsoft has developed GVFS, which seems to be a step back
to CVCS world.

I want to get a feedback (from more experienced git users than I am)
on what it would take to implement sparse fetching/pulling.
(Downloading only objects related to the sparse-checkout list)
Are there any issues with missing hashes?
Are there any fundamental problems why it can't be done?
Can we get away with only client-side changes or would it require
special features on the server side?



I have, for separate reasons been _thinking_ about the issue ($dayjob is 
in

defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as 
client

side), as it is the server that decides what is sent over the wire in the
pack files, which would need to be a 'narrow' pack file.


If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are mov

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jonathan Nieder
Jeff Hostetler wrote:
> On 11/30/2017 6:43 PM, Philip Oakley wrote:

>> The 'companies' problem is that it tends to force a client-server, always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>
> Yes, this does tend to lead towards an always-online mentality.
> However, there are 2 parts:
> [a] dynamic object fetching for missing objects, such as during a
> random command like diff or blame or merge.  We need this
> regardless of usage -- because we can't always predict (or
> dry-run) every command the user might run in advance.
> [b] batch fetch mode, such as using partial-fetch to match your
> sparse-checkout so that you always have the blobs of interest
> to you.  And assuming you don't wander outside of this subset
> of the tree, you should be able to work offline as usual.
> If you can work within the confines of [b], you wouldn't need to
> always be online.

Just to amplify this: for our internal use we care a lot about
disconnected usage working.  So it is not like we have forgotten about
this use case.

> We might also add a part [c] with explicit commands to back-fill or
> alter your incomplete view of the ODB

Agreed, this will be a nice thing to add.

[...]
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>> known fixed format, and have the same effect (conceptually) as the
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>> elsewhere'.
>
> We do have something like this.  Jonathan can explain better than I, but
> basically, we denote possibly incomplete packfiles from partial clones
> and fetches as "promisor" and have special rules in the code to assert
> that a missing blob referenced from a "promisor" packfile is OK and can
> be fetched later if necessary from the "promising" remote.
>
> The main problem with markers or other lists of missing objects is
> that it has scale problems for large repos.

Any chance that we can get a design doc in Documentation/technical/
giving an overview of the design, with a brief "alternatives
considered" section describing this kind of thing?

E.g. some of the earlier descriptions like
 
https://public-inbox.org/git/20170915134343.3814d...@twelve2.svl.corp.google.com/
 https://public-inbox.org/git/cover.1506714999.git.jonathanta...@google.com/
 https://public-inbox.org/git/20170113155253.1644-1-benpe...@microsoft.com/
may help as a starting point.

Thanks,
Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jonathan Nieder
Hi,

Jeff Hostetler wrote:
> On 11/30/2017 3:03 PM, Jonathan Nieder wrote:

>> One piece of missing functionality that looks intereseting to me: that
>> series batches fetches of the missing blobs involved in a "git
>> checkout" command:
>>
>>   https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/
>>
>> But if doesn't batch fetches of the missing blobs involved in a "git
>> diff  " command.  That might be a good place to get
>> your hands dirty. :)
>
> Jonathan Tan added code in unpack-trees to bulk fetch missing blobs
> before a checkout.  This is limited to the missing blobs needed for
> the target commit.  We need this to make checkout seamless, but it
> does mean that checkout may need online access.

Just to clarify: other parts of the series already fetch all missing
blobs that are required for a command.  What that bulk-fetch patch
does is to make that more efficient, by using a single fetch request
to grab all the blobs that are needed for checkout, instead of one
fetch per blob.

This doesn't change the online access requirement: online access is
needed if and only if you don't have the required objects already
available locally.  For example, if at clone time you specified a
sparse checkout pattern and you haven't changed that sparse checkout
pattern, then online access is not needed for checkout.

> I've also talked about a pre-fetch capability to bulk fetch missing
> blobs in advance of some operation.  You could speed up the above
> diff command or back-fill all the blobs I might need before going
> offline for a while.

In particular, something like this seems like a very valuable thing to
have when changing the sparse checkout pattern.

Thanks,
Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jeff Hostetler



On 11/30/2017 6:43 PM, Philip Oakley wrote:

From: "Vitaly Arbuzov" 

[...]

comments below..


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:

Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
wrote:



On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:

[...]




I have, for separate reasons been _thinking_ about the issue ($dayjob is in
defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as 
client
side), as it is the server that decides what is sent over the wire in 
the pack files, which would need to be a 'narrow' pack file.


Yes, there will need to be both client and server changes.
In the current 3 part patch series, the client sends a "filter_spec"
to the server as part of the fetch-pack/upload-pack protocol.
If the server chooses to honor it, upload-pack passes the filter_spec
to pack-objects to build an "incomplete" packfile omitting various
objects (currently blobs).  Proprietary servers will need similar
changes to support this feature.

Discussing this feature in the context of the defense industry
makes me a little nervous.  (I used to be in that area.)
What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment.  But it's a start




If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.


The 'companies' problem is that it tends to force a client-server, 
always-on
on-line mentality. I'm also wanting the original DVCS off-line 
capability to

still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet 
looked

at, but expect to have. IIUC Jeff's work is that on-line view, without the
off-line capability.

I'd commented early in the series at [1,2,3].


Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
random command like diff or blame or merge.  We need this
regardless of usage -- because we can't always predict (or
dry-run) every command the user might run in advance.
[b] batch fetch mode, such as using partial-fetch to match your
sparse-checkout so that you always have the blobs of interest
to you.  And assuming you don't wander outside of this subset
of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.

We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB (as I explained in response
to the "git diff  " comment later in this thread.



At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a 
known fixed format, and have the same effect (conceptually) as the 
sub-module markers - they _confirm_ the oid, yet say 'not here, try 
elsewhere'.


We do have something like this.  Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.

The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.  Suppose I have 100M
blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking.  If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from the tracking data.  Not impossible, but not
speedy either.



The comaprison with submodules mean there is the same chance of
de-synchronisation with triangular and upstream servers, unless managed.

The server side, as noted, will need to be included as it is the one that
decides the pack file.

Options for a server management are:

- "I accept narrow packs?" No; yes

- "I serve narrow packs?" No; yes.

- "Repo completeness checks on reciept": (must be complete) || (allow 
narrow to nothing).


we have new config settings for the server to allow/reject
partial clones.

and we have code in fsck/gc to handle these incomplete packfiles.



For server farms (e.g. Github..) the settings could be global, or by repo.
(note that the completeness requirement and narrow reciept option

Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jeff Hostetler



On 11/30/2017 3:03 PM, Jonathan Nieder wrote:

Hi Vitaly,

Vitaly Arbuzov wrote:


Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?


That's a great question!  I've filed https://crbug.com/git/2 to track
this project.  Feel free to star it to get updates there, or to add
updates of your own.


Thanks!



As described at https://crbug.com/git/2#c1, currently there are three
patch series for which review would be very welcome.  Building on top
of them is welcome as well.  Please make sure to coordinate with
jeffh...@microsoft.com and jonathanta...@google.com (e.g. through the
bug tracker or email).

One piece of missing functionality that looks intereseting to me: that
series batches fetches of the missing blobs involved in a "git
checkout" command:

  https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/

But if doesn't batch fetches of the missing blobs involved in a "git
diff  " command.  That might be a good place to get
your hands dirty. :)


Jonathan Tan added code in unpack-trees to bulk fetch missing blobs
before a checkout.  This is limited to the missing blobs needed for
the target commit.  We need this to make checkout seamless, but it
does mean that checkout may need online access.

I've also talked about a pre-fetch capability to bulk fetch missing
blobs in advance of some operation.  You could speed up the above
diff command or back-fill all the blobs I might need before going
offline for a while.

You can use the options that were added to rev-list to help with this.
For example:
git rev-list --objects [--filter=] --missing=print 
git rev-list --objects [--filter=] --missing=print ..
And then pipe that into a "git fetch-pack --stdin".

You might experiment with this.




Thanks,
Jonathan



Thanks,
Jeff



Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jeff Hostetler



On 11/30/2017 12:44 PM, Vitaly Arbuzov wrote:

Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?



Sure.  Extra hands are always welcome.

Jonathan Tan and I have been working on this together.
Our V5 is on the mailing now.  We have privately exchanged
some commits that I hope to push up as a V6 today or Monday.

As for how to help, I'll have to think about that a bit.
Without knowing your experience level in the code or your
availability, it is hard to pick something specific right
now.

As a first step, build my core/pc5_p3 branch and try using
partial clone/fetch between local repos.  You can look at
the tests we added (t0410, t5317, t5616, t6112) to see sample
setups using a local pair of repos.  Then try actually using
the partial clone repo for actual work (dogfooding) and see
how it falls short of your expectations.

You might try duplicating the above tests to use a
local "git daemon" serving the remote and do partial clones
using localhost URLs rather than file:// URLs.  That would
exercise the transport differently.

The t5616 test has the start of some end-to-end tests that
try combine multiple steps -- such as do a partial clone
with no blobs and then run blame on a file.  You could extend
that with more tests that test odd combinations of commands
and confirm that we can handle missing blobs in different
scenarios.

Since you've expressed an interest in sparse-checkout and
having a complete end-to-end experience, you might also
experiment with adapting the above tests to use the sparse
filter (--filter=sparse:oid=) instead of blobs:none
or blobs:limit.  See where that takes you and add tests
as you see fit.  The goal being to get tests in place that
match the usage you want to see (even if they fail) and
see what that looks like.

I know it is not as glamorous as adding new functionality,
but it would help get you up-to-speed on the code and we
do need additional tests.

Thanks
Jeff


Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jeff Hostetler



On 11/30/2017 12:01 PM, Vitaly Arbuzov wrote:

Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?



There are no summary docs in a traditional sense.
The patch series does have updated docs which show
the changes to some of the commands and protocols.
I would start there.

Jeff



Re: How hard would it be to implement sparse fetching/pulling?

2017-12-01 Thread Jeff Hostetler



On 11/30/2017 8:51 PM, Vitaly Arbuzov wrote:

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.


With the current patches (parts 1,2,3) we can pass a blob-ish
to the server during a clone that refers to a sparse-checkout
specification.  There's a bit of a chicken-n-egg problem getting
things set up.  So if we assume your team would create a series
of "known enlistments" under version control, then you could
just reference one by : during your clone.  The
server can lookup that blob and just use it.

git clone --filter=sparse:oid=master:templates/bar URL

And then the server will filter-out the unwanted blobs during
the clone.  (The current version only filters blobs; you still
get full commits and trees.  That will be revisited later.)

On the client side, the partial clone installs local config
settings into the repo so that subsequent fetches default to
the same filter criteria as used in the clone.


I don't currently have provision to send a full sparse-checkout
specification to the server during a clone or fetch.  That
seemed like too much to try to squeeze into the protocols.
We can revisit this later if there is interest, but it wasn't
critical for the initial phase.




2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.


I don't believe partial clone and/or partial fetch will cause
any changes for push.




3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout


There are 2 independent concepts here: clone and checkout.
Currently, there isn't any automatic linkage of the partial clone to
the sparse-checkout settings, so you could do something like this:

git clone --no-checkout --filter=sparse:oid=master:templates/bar URL
git cat-file ... templates/bar >.git/info/sparse-checkout
git config core.sparsecheckout true
git checkout ...

I've been focused on the clone/fetch issues and have not looked
into the automation to couple them.




4. Showing log for sparsely cloned repo.
Preconditions: #3 is followed
Actions:
git log
Expected results: recent changes that affect src/bar tree.


If I understand your meaning, log would only show changes
within the sparse subset of the tree.  This is not on my
radar for partial clone/fetch.  It would be a nice feature
to have, but I think it would be better to think about it
from the point of view of sparse-checkout rather than clone.




5. Showing diff.
Preconditions: #3 is followed
Actions:
git diff HEAD^ HEAD
Expected results: changes from the most recent commit affecting
src/bar folder are shown.
Notes: this can be tricky operation as filtering must be done to
remove results from unrelated subtrees.


I don't have any plan for this and I don't think it fits within
the scope of clone/fetch.  I think this too would be a sparse-checkout
feature.




*Note that I intentionally didn't mention use cases that are related
to filtering by blob size as I think we should logically consider them
as a separate, although related, feature.


I've grouped blob-size and sparse filter together for the
purposes of clone/fetch since the basic mechanisms (filtering,
transport, and missing object handling) are the same for both.
They do lead to different end-uses, but that is above my level
here.




What do you think about these examples above? Is that something that
more-or-less fits into current development? Are there other important
flows that I've missed?


These are all good ideas and it is good to have someone else who
wants to use partial+sparse thinking about it and looking for gaps
as we try to make a complete end-to-end feature.


-Vitaly


Thanks
Jeff



Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Vitaly Arbuzov
Makes sense, I think this perfectly aligns with our needs too.
Let me dive deeper into those patches and previous discussions, that
you've kindly shared above, so I better understand details.

I'm very excited about what you guys already did, it's a big deal for
the community!


On Thu, Nov 30, 2017 at 6:51 PM, Jonathan Nieder  wrote:
> Hi Vitaly,
>
> Vitaly Arbuzov wrote:
>
>> I think it would be great if we high level agree on desired user
>> experience, so let me put a few possible use cases here.
>
> I think one thing this thread is pointing to is a lack of overview
> documentation about how the 'partial clone' series currently works.
> The basic components are:
>
>  1. extending git protocol to (1) allow fetching only a subset of the
> objects reachable from the commits being fetched and (2) later,
> going back and fetching the objects that were left out.
>
> We've also discussed some other protocol changes, e.g. to allow
> obtaining the sizes of un-fetched objects without fetching the
> objects themselves
>
>  2. extending git's on-disk format to allow having some objects not be
> present but only be "promised" to be obtainable from a remote
> repository.  When running a command that requires those objects,
> the user can choose to have it either (a) error out ("airplane
> mode") or (b) fetch the required objects.
>
> It is still possible to work fully locally in such a repo, make
> changes, get useful results out of "git fsck", etc.  It is kind of
> similar to the existing "shallow clone" feature, except that there
> is a more straightforward way to obtain objects that are outside
> the "shallow" clone when needed on demand.
>
>  3. improving everyday commands to require fewer objects.  For
> example, if I run "git log -p", then I way to see the history of
> most files but I don't necessarily want to download large binary
> files just to print 'Binary files differ' for them.
>
> And by the same token, we might want to have a mode for commands
> like "git log -p" to default to restricting to a particular
> directory, instead of downloading files outside that directory.
>
> There are some fundamental changes to make in this category ---
> e.g. modifying the index format to not require entries for files
> outside the sparse checkout, to avoid having to download the
> trees for them.
>
> The overall goal is to make git scale better.
>
> The existing patches do (1) and (2), though it is possible to do more
> in those categories. :)  We have plans to work on (3) as well.
>
> These are overall changes that happen at a fairly low level in git.
> They mostly don't require changes command-by-command.
>
> Thanks,
> Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Jonathan Nieder
Hi Vitaly,

Vitaly Arbuzov wrote:

> I think it would be great if we high level agree on desired user
> experience, so let me put a few possible use cases here.

I think one thing this thread is pointing to is a lack of overview
documentation about how the 'partial clone' series currently works.
The basic components are:

 1. extending git protocol to (1) allow fetching only a subset of the
objects reachable from the commits being fetched and (2) later,
going back and fetching the objects that were left out.

We've also discussed some other protocol changes, e.g. to allow
obtaining the sizes of un-fetched objects without fetching the
objects themselves

 2. extending git's on-disk format to allow having some objects not be
present but only be "promised" to be obtainable from a remote
repository.  When running a command that requires those objects,
the user can choose to have it either (a) error out ("airplane
mode") or (b) fetch the required objects.

It is still possible to work fully locally in such a repo, make
changes, get useful results out of "git fsck", etc.  It is kind of
similar to the existing "shallow clone" feature, except that there
is a more straightforward way to obtain objects that are outside
the "shallow" clone when needed on demand.

 3. improving everyday commands to require fewer objects.  For
example, if I run "git log -p", then I way to see the history of
most files but I don't necessarily want to download large binary
files just to print 'Binary files differ' for them.

And by the same token, we might want to have a mode for commands
like "git log -p" to default to restricting to a particular
directory, instead of downloading files outside that directory.

There are some fundamental changes to make in this category ---
e.g. modifying the index format to not require entries for files
outside the sparse checkout, to avoid having to download the
trees for them.

The overall goal is to make git scale better.

The existing patches do (1) and (2), though it is possible to do more
in those categories. :)  We have plans to work on (3) as well.

These are overall changes that happen at a fairly low level in git.
They mostly don't require changes command-by-command.

Thanks,
Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Vitaly Arbuzov
I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.

2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.

3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout

4. Showing log for sparsely cloned repo.
Preconditions: #3 is followed
Actions:
git log
Expected results: recent changes that affect src/bar tree.

5. Showing diff.
Preconditions: #3 is followed
Actions:
git diff HEAD^ HEAD
Expected results: changes from the most recent commit affecting
src/bar folder are shown.
Notes: this can be tricky operation as filtering must be done to
remove results from unrelated subtrees.

*Note that I intentionally didn't mention use cases that are related
to filtering by blob size as I think we should logically consider them
as a separate, although related, feature.

What do you think about these examples above? Is that something that
more-or-less fits into current development? Are there other important
flows that I've missed?

-Vitaly

On Thu, Nov 30, 2017 at 5:27 PM, Vitaly Arbuzov  wrote:
> Jonathan, thanks for references, that is super helpful, I will follow
> your suggestions.
>
> Philip, I agree that keeping original DVCS off-line capability is an
> important point. Ideally this feature should work even with remotes
> that are located on the local disk.
> Which part of Jeff's work do you think wouldn't work offline after
> repo initialization is done and sparse fetch is performed? All the
> stuff that I've seen seems to be quite usable without GVFS.
> I'm not sure if we need to store markers/tombstones on the client,
> what problem does it solve?
>
> On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley  wrote:
>> From: "Vitaly Arbuzov" 
>>>
>>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>>
>>> Looking at commits I see that you've done a lot of work already,
>>> including packing, filtering, fetching, cloning etc.
>>> What are some areas that aren't complete yet? Do you need any help
>>> with implementation?
>>>
>>
>> comments below..
>>
>>>
>>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:

 Hey Jeff,

 It's great, I didn't expect that anyone is actively working on this.
 I'll check out your branch, meanwhile do you have any design docs that
 describe these changes or can you define high level goals that you
 want to achieve?

 On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
 wrote:
>
>
>
> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>
>>
>> Hi guys,
>>
>> I'm looking for ways to improve fetch/pull/clone time for large git
>> (mono)repositories with unrelated source trees (that span across
>> multiple services).
>> I've found sparse checkout approach appealing and helpful for most of
>> client-side operations (e.g. status, reset, commit, etc.)
>> The problem is that there is no feature like sparse fetch/pull in git,
>> this means that ALL objects in unrelated trees are always fetched.
>> It may take a lot of time for large repositories and results in some
>> practical scalability limits for git.
>> This forced some large companies like Facebook and Google to move to
>> Mercurial as they were unable to improve client-side experience with
>> git while Microsoft has developed GVFS, which seems to be a step back
>> to CVCS world.
>>
>> I want to get a feedback (from more experienced git users than I am)
>> on what it would take to implement sparse fetching/pulling.
>> (Downloading only objects related to the sparse-checkout list)
>> Are there any issues with missing hashes?
>> Are there any fundamental problems why it can't be done?
>> Can we get away with only client-side changes or would it require
>> special features on the server side?
>>
>>
>> I have, for separate reason

Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Vitaly Arbuzov
Jonathan, thanks for references, that is super helpful, I will follow
your suggestions.

Philip, I agree that keeping original DVCS off-line capability is an
important point. Ideally this feature should work even with remotes
that are located on the local disk.
Which part of Jeff's work do you think wouldn't work offline after
repo initialization is done and sparse fetch is performed? All the
stuff that I've seen seems to be quite usable without GVFS.
I'm not sure if we need to store markers/tombstones on the client,
what problem does it solve?

On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley  wrote:
> From: "Vitaly Arbuzov" 
>>
>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>
>> Looking at commits I see that you've done a lot of work already,
>> including packing, filtering, fetching, cloning etc.
>> What are some areas that aren't complete yet? Do you need any help
>> with implementation?
>>
>
> comments below..
>
>>
>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:
>>>
>>> Hey Jeff,
>>>
>>> It's great, I didn't expect that anyone is actively working on this.
>>> I'll check out your branch, meanwhile do you have any design docs that
>>> describe these changes or can you define high level goals that you
>>> want to achieve?
>>>
>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
>>> wrote:



 On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>
>
> Hi guys,
>
> I'm looking for ways to improve fetch/pull/clone time for large git
> (mono)repositories with unrelated source trees (that span across
> multiple services).
> I've found sparse checkout approach appealing and helpful for most of
> client-side operations (e.g. status, reset, commit, etc.)
> The problem is that there is no feature like sparse fetch/pull in git,
> this means that ALL objects in unrelated trees are always fetched.
> It may take a lot of time for large repositories and results in some
> practical scalability limits for git.
> This forced some large companies like Facebook and Google to move to
> Mercurial as they were unable to improve client-side experience with
> git while Microsoft has developed GVFS, which seems to be a step back
> to CVCS world.
>
> I want to get a feedback (from more experienced git users than I am)
> on what it would take to implement sparse fetching/pulling.
> (Downloading only objects related to the sparse-checkout list)
> Are there any issues with missing hashes?
> Are there any fundamental problems why it can't be done?
> Can we get away with only client-side changes or would it require
> special features on the server side?
>
>
> I have, for separate reasons been _thinking_ about the issue ($dayjob is in
> defence, so a similar partition would be useful).
>
> The changes would almost certainly need to be server side (as well as client
> side), as it is the server that decides what is sent over the wire in the
> pack files, which would need to be a 'narrow' pack file.
>
> If we had such a feature then all we would need on top is a separate
> tool that builds the right "sparse" scope for the workspace based on
> paths that developer wants to work on.
>
> In the world where more and more companies are moving towards large
> monorepos this improvement would provide a good way of scaling git to
> meet this demand.
>
>
> The 'companies' problem is that it tends to force a client-server, always-on
> on-line mentality. I'm also wanting the original DVCS off-line capability to
> still be available, with _user_ control, in a generic sense, of what they
> have locally available (including files/directories they have not yet looked
> at, but expect to have. IIUC Jeff's work is that on-line view, without the
> off-line capability.
>
> I'd commented early in the series at [1,2,3].
>
>
> At its core, my idea was to use the object store to hold markers for the
> 'not yet fetched' objects (mainly trees and blobs). These would be in a
> known fixed format, and have the same effect (conceptually) as the
> sub-module markers - they _confirm_ the oid, yet say 'not here, try
> elsewhere'.
>
> The comaprison with submodules mean there is the same chance of
> de-synchronisation with triangular and upstream servers, unless managed.
>
> The server side, as noted, will need to be included as it is the one that
> decides the pack file.
>
> Options for a server management are:
>
> - "I accept narrow packs?" No; yes
>
> - "I serve narrow packs?" No; yes.
>
> - "Repo completeness checks on reciept": (must be complete) || (allow narrow
> to nothing).
>
> For server farms (e.g. Github..) the settings could be global, or by repo.
> (note that the completeness requirement and narrow reciept option are not
> incompatible - the recipient server can reject the pack from a narrow
> subordinate as incomplete - see below)
>
> * Marking of 'missing' objects in the local obj

Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Philip Oakley

From: "Vitaly Arbuzov" 

Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?



comments below..


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:

Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler 
wrote:



On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:


Hi guys,

I'm looking for ways to improve fetch/pull/clone time for large git
(mono)repositories with unrelated source trees (that span across
multiple services).
I've found sparse checkout approach appealing and helpful for most of
client-side operations (e.g. status, reset, commit, etc.)
The problem is that there is no feature like sparse fetch/pull in git,
this means that ALL objects in unrelated trees are always fetched.
It may take a lot of time for large repositories and results in some
practical scalability limits for git.
This forced some large companies like Facebook and Google to move to
Mercurial as they were unable to improve client-side experience with
git while Microsoft has developed GVFS, which seems to be a step back
to CVCS world.

I want to get a feedback (from more experienced git users than I am)
on what it would take to implement sparse fetching/pulling.
(Downloading only objects related to the sparse-checkout list)
Are there any issues with missing hashes?
Are there any fundamental problems why it can't be done?
Can we get away with only client-side changes or would it require
special features on the server side?



I have, for separate reasons been _thinking_ about the issue ($dayjob is in
defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as client
side), as it is the server that decides what is sent over the wire in the 
pack files, which would need to be a 'narrow' pack file.



If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.


The 'companies' problem is that it tends to force a client-server, always-on
on-line mentality. I'm also wanting the original DVCS off-line capability to
still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet looked
at, but expect to have. IIUC Jeff's work is that on-line view, without the
off-line capability.

I'd commented early in the series at [1,2,3].


At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a 
known fixed format, and have the same effect (conceptually) as the 
sub-module markers - they _confirm_ the oid, yet say 'not here, try 
elsewhere'.


The comaprison with submodules mean there is the same chance of
de-synchronisation with triangular and upstream servers, unless managed.

The server side, as noted, will need to be included as it is the one that
decides the pack file.

Options for a server management are:

- "I accept narrow packs?" No; yes

- "I serve narrow packs?" No; yes.

- "Repo completeness checks on reciept": (must be complete) || (allow narrow 
to nothing).


For server farms (e.g. Github..) the settings could be global, or by repo.
(note that the completeness requirement and narrow reciept option are not
incompatible - the recipient server can reject the pack from a narrow
subordinate as incomplete - see below)

* Marking of 'missing' objects in the local object store, and on the wire.
The missing objects are replaced by a place holder object, which used the
same oid/sha1, but has a short fixed length, with content “GitNarrowObject
”. The chance that that string would actually have such an oid clash is
the same as all other object hashes, so is a *safe* self-referential device.


* The stored object already includes length (and inferred type), so we do
know what it stands in for. Thus the local index (index file) should be able
to be recreated from the object store alone (including the ‘promised /
narrow / missing’ files/directory markers)

* the ‘same’ as sub-modules.
The potential for loss of synchronisation with a golden complete repo is
just the same as for sub-modules. (We expected object/commit X here, but it’s 
not in the store). This could happen with a small user group who have 
locally narrow clones, who interact with their local narrow server for 
‘backup’, and then fail to push f

Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Jonathan Nieder
Hi Vitaly,

Vitaly Arbuzov wrote:

> Found some details here: https://github.com/jeffhostetler/git/pull/3
>
> Looking at commits I see that you've done a lot of work already,
> including packing, filtering, fetching, cloning etc.
> What are some areas that aren't complete yet? Do you need any help
> with implementation?

That's a great question!  I've filed https://crbug.com/git/2 to track
this project.  Feel free to star it to get updates there, or to add
updates of your own.

As described at https://crbug.com/git/2#c1, currently there are three
patch series for which review would be very welcome.  Building on top
of them is welcome as well.  Please make sure to coordinate with
jeffh...@microsoft.com and jonathanta...@google.com (e.g. through the
bug tracker or email).

One piece of missing functionality that looks intereseting to me: that
series batches fetches of the missing blobs involved in a "git
checkout" command:

 https://public-inbox.org/git/20171121211528.21891-14-...@jeffhostetler.com/

But if doesn't batch fetches of the missing blobs involved in a "git
diff  " command.  That might be a good place to get
your hands dirty. :)

Thanks,
Jonathan


Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Vitaly Arbuzov
Found some details here: https://github.com/jeffhostetler/git/pull/3

Looking at commits I see that you've done a lot of work already,
including packing, filtering, fetching, cloning etc.
What are some areas that aren't complete yet? Do you need any help
with implementation?


On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov  wrote:
> Hey Jeff,
>
> It's great, I didn't expect that anyone is actively working on this.
> I'll check out your branch, meanwhile do you have any design docs that
> describe these changes or can you define high level goals that you
> want to achieve?
>
> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler  
> wrote:
>>
>>
>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>
>>> Hi guys,
>>>
>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>> (mono)repositories with unrelated source trees (that span across
>>> multiple services).
>>> I've found sparse checkout approach appealing and helpful for most of
>>> client-side operations (e.g. status, reset, commit, etc.)
>>> The problem is that there is no feature like sparse fetch/pull in git,
>>> this means that ALL objects in unrelated trees are always fetched.
>>> It may take a lot of time for large repositories and results in some
>>> practical scalability limits for git.
>>> This forced some large companies like Facebook and Google to move to
>>> Mercurial as they were unable to improve client-side experience with
>>> git while Microsoft has developed GVFS, which seems to be a step back
>>> to CVCS world.
>>>
>>> I want to get a feedback (from more experienced git users than I am)
>>> on what it would take to implement sparse fetching/pulling.
>>> (Downloading only objects related to the sparse-checkout list)
>>> Are there any issues with missing hashes?
>>> Are there any fundamental problems why it can't be done?
>>> Can we get away with only client-side changes or would it require
>>> special features on the server side?
>>>
>>> If we had such a feature then all we would need on top is a separate
>>> tool that builds the right "sparse" scope for the workspace based on
>>> paths that developer wants to work on.
>>>
>>> In the world where more and more companies are moving towards large
>>> monorepos this improvement would provide a good way of scaling git to
>>> meet this demand.
>>>
>>> PS. Please don't advice to split things up, as there are some good
>>> reasons why many companies decide to keep their code in the monorepo,
>>> which you can easily find online. So let's keep that part out the
>>> scope.
>>>
>>> -Vitaly
>>>
>>
>>
>> This work is in-progress now.  A short summary can be found in [1]
>> of the current parts 1, 2, and 3.
>>
>>> * jh/object-filtering (2017-11-22) 6 commits
>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>> * jh/partial-clone (2017-11-22) 14 commits
>>
>>
>> [1]
>> https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/
>>
>> I have a branch that contains V5 all 3 parts:
>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>
>> This is a WIP, so there are some rough edges
>> I hope to have a V6 out before the weekend with some
>> bug fixes and cleanup.
>>
>> Please give it a try and see if it fits your needs.
>> Currently, there are filter methods to filter all blobs,
>> all large blobs, and one to match a sparse-checkout
>> specification.
>>
>> Let me know if you have any questions or problems.
>>
>> Thanks,
>> Jeff


Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Vitaly Arbuzov
Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler  wrote:
>
>
> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>
>> Hi guys,
>>
>> I'm looking for ways to improve fetch/pull/clone time for large git
>> (mono)repositories with unrelated source trees (that span across
>> multiple services).
>> I've found sparse checkout approach appealing and helpful for most of
>> client-side operations (e.g. status, reset, commit, etc.)
>> The problem is that there is no feature like sparse fetch/pull in git,
>> this means that ALL objects in unrelated trees are always fetched.
>> It may take a lot of time for large repositories and results in some
>> practical scalability limits for git.
>> This forced some large companies like Facebook and Google to move to
>> Mercurial as they were unable to improve client-side experience with
>> git while Microsoft has developed GVFS, which seems to be a step back
>> to CVCS world.
>>
>> I want to get a feedback (from more experienced git users than I am)
>> on what it would take to implement sparse fetching/pulling.
>> (Downloading only objects related to the sparse-checkout list)
>> Are there any issues with missing hashes?
>> Are there any fundamental problems why it can't be done?
>> Can we get away with only client-side changes or would it require
>> special features on the server side?
>>
>> If we had such a feature then all we would need on top is a separate
>> tool that builds the right "sparse" scope for the workspace based on
>> paths that developer wants to work on.
>>
>> In the world where more and more companies are moving towards large
>> monorepos this improvement would provide a good way of scaling git to
>> meet this demand.
>>
>> PS. Please don't advice to split things up, as there are some good
>> reasons why many companies decide to keep their code in the monorepo,
>> which you can easily find online. So let's keep that part out the
>> scope.
>>
>> -Vitaly
>>
>
>
> This work is in-progress now.  A short summary can be found in [1]
> of the current parts 1, 2, and 3.
>
>> * jh/object-filtering (2017-11-22) 6 commits
>> * jh/fsck-promisors (2017-11-22) 10 commits
>> * jh/partial-clone (2017-11-22) 14 commits
>
>
> [1]
> https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/
>
> I have a branch that contains V5 all 3 parts:
> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>
> This is a WIP, so there are some rough edges
> I hope to have a V6 out before the weekend with some
> bug fixes and cleanup.
>
> Please give it a try and see if it fits your needs.
> Currently, there are filter methods to filter all blobs,
> all large blobs, and one to match a sparse-checkout
> specification.
>
> Let me know if you have any questions or problems.
>
> Thanks,
> Jeff


Re: How hard would it be to implement sparse fetching/pulling?

2017-11-30 Thread Jeff Hostetler



On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:

Hi guys,

I'm looking for ways to improve fetch/pull/clone time for large git
(mono)repositories with unrelated source trees (that span across
multiple services).
I've found sparse checkout approach appealing and helpful for most of
client-side operations (e.g. status, reset, commit, etc.)
The problem is that there is no feature like sparse fetch/pull in git,
this means that ALL objects in unrelated trees are always fetched.
It may take a lot of time for large repositories and results in some
practical scalability limits for git.
This forced some large companies like Facebook and Google to move to
Mercurial as they were unable to improve client-side experience with
git while Microsoft has developed GVFS, which seems to be a step back
to CVCS world.

I want to get a feedback (from more experienced git users than I am)
on what it would take to implement sparse fetching/pulling.
(Downloading only objects related to the sparse-checkout list)
Are there any issues with missing hashes?
Are there any fundamental problems why it can't be done?
Can we get away with only client-side changes or would it require
special features on the server side?

If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.

PS. Please don't advice to split things up, as there are some good
reasons why many companies decide to keep their code in the monorepo,
which you can easily find online. So let's keep that part out the
scope.

-Vitaly




This work is in-progress now.  A short summary can be found in [1]
of the current parts 1, 2, and 3.


* jh/object-filtering (2017-11-22) 6 commits
* jh/fsck-promisors (2017-11-22) 10 commits
* jh/partial-clone (2017-11-22) 14 commits


[1] https://public-inbox.org/git/xmqq1skh6fyz@gitster.mtv.corp.google.com/T/

I have a branch that contains V5 all 3 parts:
https://github.com/jeffhostetler/git/tree/core/pc5_p3

This is a WIP, so there are some rough edges
I hope to have a V6 out before the weekend with some
bug fixes and cleanup.

Please give it a try and see if it fits your needs.
Currently, there are filter methods to filter all blobs,
all large blobs, and one to match a sparse-checkout
specification.

Let me know if you have any questions or problems.

Thanks,
Jeff