In the transition from git-0.04 to git-0.5 (Linus' track) the naming
convention of files in the object database has been changed: a file's
name passed from the sha1 of its contents to the sha1 of its contents
*before* compression.
This change was preceded by a long discussion hence both conventions
have arguments for and against.
I would like to suggest to adopt a more versatile solution:
preserve the pure sha1 based names for the sha1 sum of the file's
contents. I mean,
(*) for files with name xy/z{38} their sha1 sum is xyz{38}
allow other files (or links) with names of the form
xy/z{38}.EXTENSION
where for every EXTENSION the file's content would be the EXTENSION
representation of the file xy/z{38} . For every representation type
EXTENSION there should be procedures to derive the file xy/z{38}
from the name xy/z{38}.EXTENSION and vice-versa (assuming that the
representation type EXTENSION cares about the contents of file
xy/z{38}).
Let me give two examples:
all the files in the object database of git-0.04 are just fine, they
satisfy axiom (*)
the name of every file xy/z{38} in the git-0.5 data base should be
changed to xy/z{38}.g assuming that we will use EXTENSION g as the
git representation type. The conversion algorithms would be:
cat-file `cat-file -t xyz{38}` xyz{38} to obtain the contents
represented by xy/z{38}.g whose sha1 is xyz{38}
and a utility program has to be written to check whether a given
file F, is a valid contents as far as git is concerned and in
case it is compute its sha1 sum xyz{38} and also comute the file
the file xy/z{38}.g .
So, what are the advantages of this further complication? I see these ones:
git is separated from the idea of sha1 addressable contents, which
indeed is an idea larger than git itself. This same or similar
addressing schemes can (and most probably will) be applied to other
contents besides SCMs. An example would be a digital library of
scientific papers in pdf together with its OAI compliant meta data
(don't bother if you are not familiar with these terms, it is just
an example and I am sure you are able to come up with many other
examples where a sha1 addressable data base would be interesting)
all these uses could share common backup schemes where axiom (*)
would be enforced. One could think of a shared p2p database of
repositories of sha1 addressed contents of all kinds. This might be
important because, in general, the contents of xyz{38} cannot be
reconstructed from its name. The way to defend against file system
corruption is replication. Why not share these backup databases?
it would be easier to experiment with other compression schemes or
other proposals for meta data in git itself.
it would be easier to experiment with the factorization of common
chunks of contents, an idea very close to the secret of rsync's
amazing efficiency.
Well, that's the proposal. I would be happy to hear comments!
Cheers,
Imre Simon
PS: the way it is, the git-0.5 README file is inconsistent. The naming
change is not reflected in the README file which in many places states
that the sha1 sum of file xy/z{38} is xyz{38}.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html