Thanks for that, Konstantin. You've made some good points, especially about storage duplication. I wasn't aware that it would be quite that high. I don't consider this an X/Y problem – I'm investigating why certain methods of file management aren't used, and you've provided darn sensible reasons for this not to be one.
However, you've also clarified that due to the overhead, git is indeed a surprisingly competent file history management solution when compared to certain others, since BTRFS snapshots are at best regular compressed backups of the entire filesystem rather than history of every modification to the repository (in this case the filesystem). I'll definitely have to go with another solution like Nextcloud instead, since I've not the storage, so I'm very thankful for the response. ________________________________ From: git-users@googlegroups.com <git-users@googlegroups.com> on behalf of Konstantin Khomoutov <kos...@bswap.ru> Sent: Wednesday, September 27, 2023 11:44:48 AM To: git-users@googlegroups.com <git-users@googlegroups.com> Subject: Re: [git-users] Considerations regarding unsynced local git repo for 100s of GiBs of data? On Tue, Sep 26, 2023 at 04:55:20PM -0700, Roke Beedell (RokeJulianLockhart) wrote: > I use cpe:/o:opensuse:tumbleweed:20230922 with a BTRFS data drive of > approximately 300 GiBs. I've rather wanted some form of file history, and > have absolutely loved using git for software development because of that. > > However, I know that it can struggle with large amounts of data, yet know > that Microsoft now uses it to manage the Windows codebase, although I don't > know whether they've structured this in the form of a monorepo. > > Consequently, can I use a local, never to be externally synced via git, git > repository for file history? To be able to see commits with my git software > would be incredible. [...] To me, the question appears to be a bit haphazard so I think maybe some clarifications are in order. For instance, does it really matter your files are located on a BTRFS volume? Does this mean you'd like the volume to be somehow _automatically_ connected to a Git repo? Or do you mean the repo in question is to be located on a BTRFS volume (as well as the "source" files), and you had performance issues in mind when were providing that information? Well, simply put, I would advise against the idea. Maybe I should first note that your idea is not original and old-timers like me, who lurk on various discussion venues dedicated to various version control systems hear such questions/proposals on a regular basis. Please note that I'm saying that not to dismiss you but to hint at the "been there done that"-style of reasoning which will follow. ;-) The arguments against mostly fall into the two broad classes: nonsensical commits and ineffective storage. Let's first consider nonsensical commits. The chief reason for version control systems to exist is to properly record a series of changes to a set of files comprising a coherent "project". Both points are crucial. To record "properly" means that each commit must represent a logically atomic change which is properly documented by the commit message. Versioning a coherent set of files means that the files in the set must be somehow related to each other - be it the source code of a software project, the text of an essay or a book, a set of configuration files for some software suite etc. I honestly fail to see how you intend to sensibly commit changes done to a 300 GB worth of files. Of course, this might mean these files are, say, assets for a computer game, some digital drawings or files handled by a CAD software. If these files are more randomly picked - like the photos you take Another subpoint here is that if your files are not plain text, you won't get meaningful diffs (display of "what was changed") for your commits. Now let's consider ineffective storage. All popular version control systems are designed to work on small text files. The simple reason for this is that this is what "tech people" deal since moving from punched cards and tapes to digitally stored files. Even today's monstorous development environments (IDEs) with all those UIs made to make you write as less text as possible, in the end still produce those small text files. Storing large binary objects in a typical VCS is still possible but it's going to waste disk space and be ineffective performance-wise. Also note that unless certain tricks are employed, any popular VCS will actually occupy locally _at least_ twice as much disk space as the "source" files. That's simply by design: any DVCS has something which Git calls "the work tree" - these are files on a filesystem which are "watched" by the VCS, which you modify and then ask the VCS to record any changes done in them, - and "the store", the actual repository which contains the data comprising all the recorded history and all the data necessary to reconstruct any of the past revisions. Simply put, this means for your 300 GB store you'd need more than twice as much storage space (in fact, 4-5 times as much or more would be a more realistic estimation). Some DVCSes have certain tricks to make working with insane amounts of data less painful but they are no magic and come with their own "strings attached". For instance, MS has developed a thing called GVFS which can be plugged into Git (working on Windows, dunno whether that works elsewhere) and turn the work tree into a virtual filesystem of sorts - fetching the relevant data from a repote server. Plain Git has "sparse checkouts" (and sparse index) which allows checking out into the work tree only the specific directories (but not others). If you squint at the rundown presented above, you might notice that what you wanted to use Git for looks suspiciously like a plain old file backup software ;-) And that is what I'd recommend to make use of instead. AFAIK, BTRFS has volume snapshotting out of the box. If you need off-site backups then things like Borg Backup work like a charm. All-in-all, you might instead state clearly your original problem, not ask about your attempted solution. We could then possibly were able to give more educated advice. -- You received this message because you are subscribed to a topic in the Google Groups "Git for human beings" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/git-users/d3p_THvxQkg/unsubscribe. To unsubscribe from this group and all its topics, send an email to git-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/git-users/20230927104448.dt2ajo74mgjabp7f%40carbon. -- You received this message because you are subscribed to the Google Groups "Git for human beings" group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/git-users/LO2P265MB25266DC0993D0CE1E1480DB9F8C2A%40LO2P265MB2526.GBRP265.PROD.OUTLOOK.COM.