To be able to better handle repos with many files that any individual
developer doesnt need it would be nice if clone/fetch only brought down
those files that were actually needed.
To enable that, we are proposing adding a flag to clone/fetch that will
instruct the server to limit the objects it sends to commits and trees
and to not send any blobs.
When git performs an operation that requires a blob that isnt currently
available locally, it will download the missing blob and add it to the
local object store.
Clone and fetch will pass a --lazy-clone flag (open to a better name
here) similar to --depth that instructs the server to only return
commits and trees and to ignore blobs.
Later during git operations like checkout, when a blob cannot be found
after checking all the regular places (loose, pack, alternates, etc),
git will download the missing object and place it into the local object
store (currently as a loose object) then resume the operation.
To prevent git from accidentally downloading all missing blobs, some git
operations are updated to be aware of the potential for missing blobs.
The most obvious being check_connected which will return success as if
everything in the requested commits is available locally.
To minimize the impact on the server, the existing dumb HTTP protocol
endpoint objects/<sha> can be used to retrieve the individual missing
blobs when needed.
We found that downloading commits and trees on demand had a significant
negative performance impact. In addition, many git commands assume all
commits and trees are available locally so they quickly got pulled down
anyway. Even in very large repos the commits and trees are relatively
small so bringing them down with the initial commit and subsequent fetch
commands was reasonable.
After cloning, the developer can use sparse-checkout to limit the set of
files to the subset they need (typically only 1-10% in these large
repos). This allows the initial checkout to only download the set of
files actually needed to complete their task. At any point, the
sparse-checkout file can be updated to include additional files which
will be fetched transparently on demand.
Typical source files are relatively small so the overhead of connecting
and authenticating to the server for a single file at a time is
substantial. As a result, having a long running process that is started
with the first request and can cache connection information between
requests is a significant performance win.
Now some numbers
One repo has 3+ million files at tip across 500K folders with 5-6K
active developers. They have done a lot of work to remove large files
from the repo so it is down to < 100GB.
Before changes: clone took hours to transfer the 87GB .pack + 119MB .idx
After changes: clone took 4 minutes to transfer 305MB .pack + 37MB .idx
After hydrating 35K files (the typical number any individual developer
needs to do their work), there was an additional 460 MB of loose files
Total savings: 86.24 GB * 6000 developers = 517 Terabytes saved!
We have another repo (3.1 M files, 618 GB at tip with no history with
3K+ active developers) where the savings are even greater.
The current prototype calls a new hook proc in sha1_object_info_extended
and read_object, to download each missing blob. A better solution would
be to implement this via a long running process that is spawned on the
first download and listens for requests to download additional objects
until it terminates when the parent git operation exits (similar to the
recent long running smudge and clean filter work).
Need to do more investigation into possible code paths that can trigger
unnecessary blobs to be downloaded. For example, we have determined
that the rename detection logic in status can also trigger unnecessary
blobs to be downloaded making status slow.
Need to investigate an alternate batching scheme where we can make a
single request for a set of "related" blobs and receive single a
packfile (especially during checkout).
Need to investigate adding a new endpoint in the smart protocol that can
download both individual blobs as well as a batch of blobs.