Re: [git-users] Considerations regarding unsynced local git repo for 100s of GiBs of data?

Roke Beedell (RokeJulianLockhart) Wed, 27 Sep 2023 05:00:42 -0700

Thanks for that, Konstantin. You've made some good points, especially about 
storage duplication. I wasn't aware that it would be quite that high. I don't 
consider this an X/Y problem – I'm investigating why certain methods of file 
management aren't used, and you've provided darn sensible reasons for this not 
to be one.

However, you've also clarified that due to the overhead, git is indeed a 
surprisingly competent file history management solution when compared to 
certain others, since BTRFS snapshots are at best regular compressed backups of 
the entire filesystem rather than history of every modification to the 
repository (in this case the filesystem).

I'll definitely have to go with another solution like Nextcloud instead, since 
I've not the storage, so I'm very thankful for the response.

________________________________
From: git-users@googlegroups.com <git-users@googlegroups.com> on behalf of 
Konstantin Khomoutov <kos...@bswap.ru>
Sent: Wednesday, September 27, 2023 11:44:48 AM
To: git-users@googlegroups.com <git-users@googlegroups.com>
Subject: Re: [git-users] Considerations regarding unsynced local git repo for 
100s of GiBs of data?

On Tue, Sep 26, 2023 at 04:55:20PM -0700, Roke Beedell (RokeJulianLockhart) 
wrote:

> I use cpe:/o:opensuse:tumbleweed:20230922 with a BTRFS data drive of
> approximately 300 GiBs. I've rather wanted some form of file history, and
> have absolutely loved using git for software development because of that.
>
> However, I know that it can struggle with large amounts of data, yet know
> that Microsoft now uses it to manage the Windows codebase, although I don't
> know whether they've structured this in the form of a monorepo.
>
> Consequently, can I use a local, never to be externally synced via git, git
> repository for file history? To be able to see commits with my git software
> would be incredible.
[...]

To me, the question appears to be a bit haphazard so I think maybe some
clarifications are in order. For instance, does it really matter your files
are located on a BTRFS volume? Does this mean you'd like the volume to be
somehow _automatically_ connected to a Git repo? Or do you mean the repo in
question is to be located on a BTRFS volume (as well as the "source" files),
and you had performance issues in mind when were providing that information?

Well, simply put, I would advise against the idea.

Maybe I should first note that your idea is not original and old-timers like
me, who lurk on various discussion venues dedicated to various version control
systems hear such questions/proposals on a regular basis. Please note that I'm
saying that not to dismiss you but to hint at the "been there done that"-style
of reasoning which will follow. ;-)

The arguments against mostly fall into the two broad classes: nonsensical
commits and ineffective storage.

Let's first consider nonsensical commits. The chief reason for version control
systems to exist is to properly record a series of changes to a set of files
comprising a coherent "project". Both points are crucial. To record "properly"
means that each commit must represent a logically atomic change which is
properly documented by the commit message. Versioning a coherent set of files
means that the files in the set must be somehow related to each other - be it
the source code of a software project, the text of an essay or a book, a set
of configuration files for some software suite etc.
I honestly fail to see how you intend to sensibly commit changes done to a 300
GB worth of files. Of course, this might mean these files are, say, assets for
a computer game, some digital drawings or files handled by a CAD software.
If these files are more randomly picked - like the photos you take

Another subpoint here is that if your files are not plain text, you won't get
meaningful diffs (display of "what was changed") for your commits.

Now let's consider ineffective storage. All popular version control systems
are designed to work on small text files. The simple reason for this is that
this is what "tech people" deal since moving from punched cards and tapes to
digitally stored files. Even today's monstorous development environments
(IDEs) with all those UIs made to make you write as less text as possible,
in the end still produce those small text files. Storing large binary objects
in a typical VCS is still possible but it's going to waste disk space and be
ineffective performance-wise.

Also note that unless certain tricks are employed, any popular VCS will
actually occupy locally _at least_ twice as much disk space as the "source"
files. That's simply by design: any DVCS has something which Git calls "the
work tree" - these are files on a filesystem which are "watched" by the VCS,
which you modify and then ask the VCS to record any changes done in them, -
and "the store", the actual repository which contains the data comprising all
the recorded history and all the data necessary to reconstruct any of the past
revisions.
Simply put, this means for your 300 GB store you'd need more than twice as
much storage space (in fact, 4-5 times as much or more would be a more
realistic estimation).

Some DVCSes have certain tricks to make working with insane amounts of data
less painful but they are no magic and come with their own "strings attached".
For instance, MS has developed a thing called GVFS which can be plugged into
Git (working on Windows, dunno whether that works elsewhere) and turn the work
tree into a virtual filesystem of sorts - fetching the relevant data from a
repote server. Plain Git has "sparse checkouts" (and sparse index) which
allows checking out into the work tree only the specific directories (but not
others).

If you squint at the rundown presented above, you might notice that what you
wanted to use Git for looks suspiciously like a plain old file backup
software ;-) And that is what I'd recommend to make use of instead.
AFAIK, BTRFS has volume snapshotting out of the box. If you need off-site
backups then things like Borg Backup work like a charm.

All-in-all, you might instead state clearly your original problem, not ask
about your attempted solution. We could then possibly were able to give more
educated advice.

--
You received this message because you are subscribed to a topic in the Google 
Groups "Git for human beings" group.
To unsubscribe from this topic, visit 
https://groups.google.com/d/topic/git-users/d3p_THvxQkg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to 
git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/20230927104448.dt2ajo74mgjabp7f%40carbon.

-- 
You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/git-users/LO2P265MB25266DC0993D0CE1E1480DB9F8C2A%40LO2P265MB2526.GBRP265.PROD.OUTLOOK.COM.

Re: [git-users] Considerations regarding unsynced local git repo for 100s of GiBs of data?

Reply via email to