Re: [PATCH v2] partial-clone: design doc

2017-12-14 Thread Jeff Hostetler



On 12/14/2017 1:24 PM, Junio C Hamano wrote:

Jeff Hostetler  writes:


+- On the client these incomplete packfiles are marked as "promisor pacfiles"


s/pacfiles/packfiles/


+  These "promisor packfiles" consist of a ".promisor" file with
+  arbitrary contents (like the ".keep" files), in addition to
+  their ".pack" and ".idx" files.
+
+  In the future, this ability may be extended to loose objects in case
+  a promisor packfile is accidentally unpacked.


Hmph.

Because we cannot assume that such an "accidental" unpacking would
do anything extra to help us tell the loose objects created out of a
promisor pack from other loose objects, you would end up making any
and all loose objects to serve as if they came from a promisor
remote?  I am not sure if that makes much sense.

Do we really need to write this "in the future" down, before we have
thought things through enough to specify the design at a bit more
detailed level?



good point.  i'll move this to the bottom and elaborate on the
problem rather than the solution.

Jeff


Re: [PATCH v2] partial-clone: design doc

2017-12-14 Thread Junio C Hamano
Jeff Hostetler  writes:

> +  There are various filters available to accomodate different situations.

s/accomodate/accommodate/

I'll squash in this and /pacfile/packfile/ typofix while queuing.

Thanks.


Re: [PATCH v2] partial-clone: design doc

2017-12-14 Thread Junio C Hamano
Jeff Hostetler  writes:

> +- On the client these incomplete packfiles are marked as "promisor pacfiles"

s/pacfiles/packfiles/

> +  These "promisor packfiles" consist of a ".promisor" file with
> +  arbitrary contents (like the ".keep" files), in addition to
> +  their ".pack" and ".idx" files.
> +
> +  In the future, this ability may be extended to loose objects in case
> +  a promisor packfile is accidentally unpacked.

Hmph.

Because we cannot assume that such an "accidental" unpacking would
do anything extra to help us tell the loose objects created out of a
promisor pack from other loose objects, you would end up making any
and all loose objects to serve as if they came from a promisor
remote?  I am not sure if that makes much sense.

Do we really need to write this "in the future" down, before we have
thought things through enough to specify the design at a bit more
detailed level?


[PATCH v2] partial-clone: design doc

2017-12-14 Thread Jeff Hostetler
From: Jeff Hostetler 

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler 
---
 Documentation/technical/partial-clone.txt | 259 ++
 1 file changed, 259 insertions(+)
 create mode 100644 Documentation/technical/partial-clone.txt

diff --git a/Documentation/technical/partial-clone.txt 
b/Documentation/technical/partial-clone.txt
new file mode 100644
index 000..731bd8c
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,259 @@
+Partial Clone Design Notes
+==
+
+The "Partial Clone" feature is a performance optimization for git that
+allows git to function without having a complete copy of the repository.
+
+During clone and fetch operations, git normally downloads the complete
+contents and history of the repository.  That is, during clone the client
+receives all of the commits, trees, and blobs in the repository into a
+local ODB.  Subsequent fetches extend the local ODB with any new objects.
+For large repositories, this can take significant time to download and
+large amounts of diskspace to store.
+
+The goal of this work is to allow git better handle extremely large
+repositories.  Often in these repositories there are many files that the
+user does not need such as ancient versions of source files, files in
+portions of the worktree outside of the user's work area, or large binary
+assets.  If we can avoid downloading such unneeded objects *in advance*
+during clone and fetch operations, we can decrease download times and
+reduce ODB disk usage.
+
+
+Non-Goals
+-
+
+Partial clone is a mechanism to limit the number of blobs and trees downloaded
+*within* a given range of commits -- and is therefore independent of and not
+intended to conflict with existing DAG-level mechanisms to limit the set of
+requested commits (i.e. shallow clone, single branch, or fetch '').
+
+
+Design Overview
+---
+
+Partial clone logically consists of the following parts:
+
+- A mechanism for the client to describe unneeded or unwanted objects to
+  the server.
+
+- A mechanism for the server to omit such unwanted objects from packfiles
+  sent to the client.
+
+- A mechanism for the client to gracefully handle missing objects (that
+  were previously omitted by the server).
+
+- A mechanism for the client to backfill missing objects as needed.
+
+
+Design Details
+--
+
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+  upload-pack negotiation.
+
+  This uses the existing capability discovery mechanism.
+  See "filter" in Documentation/technical/pack-protocol.txt.
+
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+  server to request filtering during packfile construction.
+
+  There are various filters available to accomodate different situations.
+  See "--filter=" in Documentation/rev-list-options.txt.
+
+- On the server pack-objects applies the requested filter-spec as it
+  creates "filtered" packfiles for the client.
+
+  These filtered packfiles are incomplete in the traditional sense because
+  they may contain trees that reference blobs that the client does not have.
+
+- On the client these incomplete packfiles are marked as "promisor pacfiles"
+  and treated differently by various commands.
+
+- On the client a repository extension is added to the local config to
+  prevent older versions of git from failing mid-operation because of
+  missing objects that they cannot handle.
+  See "extensions.partialClone" in 
Documentation/technical/repository-version.txt"
+
+
+Handling Missing Objects
+
+
+- An object may be missing due to a partial clone or fetch, or missing due
+  to repository corruption.  To differentiate these cases, the local
+  repository specially indicates packfiles obtained from the promisor
+  remote.
+
+  These "promisor packfiles" consist of a ".promisor" file with
+  arbitrary contents (like the ".keep" files), in addition to
+  their ".pack" and ".idx" files.
+
+  In the future, this ability may be extended to loose objects in case
+  a promisor packfile is accidentally unpacked.
+
+- The local repository considers a "promisor object" to be an object that
+  it knows (to the best of its ability) that the promisor remote has, either
+  because the local repository has that object in one of its promisor
+  packfiles, or because another promisor object refers to it.
+
+  When git encounters a missing object, Git can see if it a promisor object
+  and handle it appropriately.  If not, Git can report a corruption.
+
+  This means that there is no need for the client to explicitly maintain an
+  expensive-to-modify list of missing objects.
+
+- Since almost all Git code currently expects any referenced object to be
+  present locally and because we do not want to force every command to do
+  a dry-run first, a fallback mechanism is added