Jeff King <> writes:

> [1] One thing I've been toying is with "external alternates"; dumping
>     your large objects in some realtively slow data store (e.g., a
>     RESTful HTTP service). You could cache and cheaply query a list of
>     "sha1 / size / type" for each object from the store, but getting the
>     actual objects would be much more expensive. But again, it would
>     depend on whether you would actually have such a store directly
>     accessible by a ref.

Yeah, that actually has been another thing we were discussing
locally, without coming to something concrete enough to present to
the list.

The basic idea is to mark such paths with attributes, and use a
variant of smudge/clean filter that is _not_ a filter (as we do not
want to have the interface to this external helper to be "we feed
the whole big blob to you").  Instead, these smudgex/cleanx things
work on a pathname.

 - Your in-tree objects store a blob that records a description of
   the large thing.  Call such a blob a surrogate.  "clone", "fetch"
   and "push" all deal only with surrogates so your in-history data
   will stay small.

 - When checking out, the attributes mechanism kicks in and runs the
   "not filter" variant of smudge with the data in the surrogate.

   The surrogate records how to get the real thing from where, and
   how to validate what you got is correct.  A hand-wavy example may
   look like this:

        get: download
        sha1sum: f84667def209e4a84e37e8488a08e9eca3f208c1

   to tell you to download a single URL with whatever means suitable
   for your platform (perhaps curl or wget), and verify the result
   by running sha1sum.  Or it may involve

        get: git-fetch git:// master
        object: 85a094f22f02c54c740448f6716da608a5e89a80

   to tell you to "git fetch" from the given git-reachable resource
   into some place and grab the object via "git cat-file", possibly
   streaming it out.  The details do not matter at this point in the
   design process.

   The smudgex helper is responsible for caching previously fetched
   large contents, maintaining association between the surrogate
   blob and its real data, so that once the real thing is
   downloaded, and the contents of the path needs to change to
   something else (e.g. user checks out a different branch) and
   then change to the previous thing again (e.g. user comes back to
   the original branch), it does not download it again.

 - When checking if the working tree is clean relative to the index,
   the smudgex/cleanx helper will be consulted.  It will be given
   the surrogate data in the index and the path in the working tree.
   We may want to allow the helper implementation to give a read-only
   hardlink directly into helper's cache storage, so that it can
   consult its database of surrogate-to-real mapping and perform
   this verification cheaply by inode comparison, or something.

 - When running "git add" a modified large stuff prepared in the
   working tree, cleanx helper is called to prepare a new surrogate,
   and that is what is registered in the index.  The helper is also
   responsible for storing the new large stuff away and arrange it
   to be retrievable when others see and use this surrogate.

The initial scope of supporting something like that in core-git
would be to add the necessary infrastracture to arrange such smudgex
and cleanx helpers are called when a path is marked as a surrogate
in the attribute system, and supply a sample helper.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to
More majordomo info at

Reply via email to