Jeff King <p...@peff.net> writes:
>  One thing I've been toying is with "external alternates"; dumping
> your large objects in some realtively slow data store (e.g., a
> RESTful HTTP service). You could cache and cheaply query a list of
> "sha1 / size / type" for each object from the store, but getting the
> actual objects would be much more expensive. But again, it would
> depend on whether you would actually have such a store directly
> accessible by a ref.
Yeah, that actually has been another thing we were discussing
locally, without coming to something concrete enough to present to
The basic idea is to mark such paths with attributes, and use a
variant of smudge/clean filter that is _not_ a filter (as we do not
want to have the interface to this external helper to be "we feed
the whole big blob to you"). Instead, these smudgex/cleanx things
work on a pathname.
- Your in-tree objects store a blob that records a description of
the large thing. Call such a blob a surrogate. "clone", "fetch"
and "push" all deal only with surrogates so your in-history data
will stay small.
- When checking out, the attributes mechanism kicks in and runs the
"not filter" variant of smudge with the data in the surrogate.
The surrogate records how to get the real thing from where, and
how to validate what you got is correct. A hand-wavy example may
look like this:
get: download http://cdn.example.com/67def20
to tell you to download a single URL with whatever means suitable
for your platform (perhaps curl or wget), and verify the result
by running sha1sum. Or it may involve
get: git-fetch git://git.example.com/images.git/ master
to tell you to "git fetch" from the given git-reachable resource
into some place and grab the object via "git cat-file", possibly
streaming it out. The details do not matter at this point in the
The smudgex helper is responsible for caching previously fetched
large contents, maintaining association between the surrogate
blob and its real data, so that once the real thing is
downloaded, and the contents of the path needs to change to
something else (e.g. user checks out a different branch) and
then change to the previous thing again (e.g. user comes back to
the original branch), it does not download it again.
- When checking if the working tree is clean relative to the index,
the smudgex/cleanx helper will be consulted. It will be given
the surrogate data in the index and the path in the working tree.
We may want to allow the helper implementation to give a read-only
hardlink directly into helper's cache storage, so that it can
consult its database of surrogate-to-real mapping and perform
this verification cheaply by inode comparison, or something.
- When running "git add" a modified large stuff prepared in the
working tree, cleanx helper is called to prepare a new surrogate,
and that is what is registered in the index. The helper is also
responsible for storing the new large stuff away and arrange it
to be retrievable when others see and use this surrogate.
The initial scope of supporting something like that in core-git
would be to add the necessary infrastracture to arrange such smudgex
and cleanx helpers are called when a path is marked as a surrogate
in the attribute system, and supply a sample helper.
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html