Re: [git-users] How does Git storing entire files rather than deltas make it superior?

Philip Oakley Sun, 03 Nov 2019 03:34:41 -0800

Hi, a couple of extra comments about the theme.

On 03/11/2019 00:01, Michael wrote:
+1.

On 2019-11-01, at 12:39 PM, likejudo <anil.r...@gmail.com<mailto:anil.r...@gmail.com>> wrote:
I was wondering if this isn't space inefficient - and how does itbecome superior to a VCS by storing snapshots rather than deltas?

Initially Git is space inefficient as the first step is simply to justzlib compress each and every file in the new snapshot. However if theobject id Hash value is identical (nothing changed!) then there isnothing new to store - instant de-duplication. But if it did change,hmm, then we do get a whole new object.

But that Torvalds guy was sneaky, and so (third step) he knew that thenew and old files were mostly identical, often with common text evenacross other files, so he create the pack compression mechanism whichrecords **similarity** (old version is current version from start topoint A, and from point B to the end, you'd inserted texted between Aand B. i.e. the diff will be A-B). Hence ... [Michael's good points]

Some people will cite studies showing that the pack files have bettercompression than you'd normally expect; this is to be expected fromcompressing a larger amount of data.

Some people will cite that "unmodified files checkedsummed" preventunexpected alterations; git is actually the first example I know of ofa block-chain in real life, before it was called a block chain. Gitgains all the advantages of blockchains for detecting accuracy.

all true.

What I think is the success of Git, which is implicit in the way Linuxdevelopment works, is that "Control" [aka managers from hell] has been*distributed* from the management to the user. You no longer need anypermission to store anything you want into the holy shrine of the the"VCS" (in case you might have somehow contaminated it).The manager is relieved of those horrid tasks of handling coders, andsimply lists the valid hash of the "correct" versions. All your nif naffand trivia are local to you, but are secure, and validated by their ownhashes. You can get back to the various interim states you were atwithout worry.

The critical point here (and this is slightly philosophical) is thatthere is no longer a single MASTER (see works of art such as Mona Lisa,or your code..) Code can be perfectly replicated at almost zero cost.It's value is in having the correct copy (the hash), rather than havingthe only copy. The whole version "Control" paradigm has broken out ofthe 'fragility' box that bedevilled physical artefact control (paperdrawings, serialised parts, VIN numbers on cars).

Having broken the veracity problem, the diff based approach with acentral authority falls away, especially when the pack file technique isincluded.

There is still the problem of non-diffable files (e.g. audio/video (AV)edits) where it is still an all or nothing problem (especially forpacking), but that is an issue common to both approaches. The Microsoftcontributors are looking at how they can handle the Windows mono-repo(largest in the world!), and then hopefully others will look at thelarge mono-file problems (how to diff and merge AV files)

Some people will question what "superior" means.
The bottom line is this: Git was developed for the linux kernel. Gitwas developed based on the needs of a decently sized small project.
Yea, there was a time when I thought linux was big. "Big" is what youget when Microsoft and Google both start moving theirdevelopment/version control over to git. There's stuff in git designedto deal with very, very large archives that these two have contributed.
In a nutshell, git has these advantages over everything else that camebefore it:
1. Ability to work with really large archives.
2. Ability to recover not just a version of a file, but a version of aproject, even as filenames change3. Ability to check what changes were made in a given subdirectoryduring a period of time -- used by people working on a subset of thelinux kernel, for example.
4. Ability to merge more than two deltas off a previous base
5. Ability to ensure no one slipped unauthorized changes into thesource code.6. Ability to have different people work on different files at thesame time without ever running into "locking" issues, without havingto have a network connection at "checkout" time, without needing tohave a concept of checking out.7. Ability to consider anyone's copy as the "master" copy -- useful ifthe maintain/"master" of a project changes.
When you consider these goals, space used by text files isn't nearlyas important. Once you get to something the size of the linux codebase, you can start to think that you might be consuming disk space.
====
As stated, the best way to think of git is a read-only filesystem.Files are presented to git in their "only" finished form, and do notget stored in the filesystem until finished. There is no"differential" at the lowest level, only a bunch of full files that donot change.

Personally I don't use the read-only filesystem metaphor as it doesn'tquite work for me, but I can see that it is a useful analogy (I'd gethung up on the decision about which files are journaled in Git or not)

Everything else is layered on top of that.

The files are named by their hash code.
There are files that contain mappings of file user-names to hash codes-- which in turn have a hash code name. These are the "directorylistings". Some of those files are sub-directories instead ofuser-supplied files.There are files that contain the hash of the top-level projectdirectory, and information about which version that project directoryhas represents.
What does this not give you, that has to be calculated all the time?The diff from version N to N+1. When you want to apply "what changed"between C and D as a rebase onto B.
Diff-based VCS's give you that cheaply, but lose all the other benefits.
Linux found those benefits to be better.
Microsoft and Google are switching.

Are there issues/problems? Sure.
Are they less of an issue this way than any other way so far? Seemslike it.
Are there features people would like to see in Git? Yep.
Could most of them be added to git without changing the "Read onlyfilesystem" at the heart? Yes.
Is there a better system design than git? Sure. Do we know what it is?Probably not.

--
Philip

--
You received this message because you are subscribed to the Google Groups "Git for 
human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/22a03422-3bd3-cca3-a069-da1536894f5c%40iee.email.

Re: [git-users] How does Git storing entire files rather than deltas make it superior?

Reply via email to