Re: [git-users] worlds slowest git repo- what to do?
Duy Nguyen , I have 7700 files in the git repo. Add is much much faster than commit -m text . My most-populous git repos has 57K files (its an operating system) and I have no issues with the 57K repo. -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] worlds slowest git repo- what to do?
From: John Fisher fishook2...@gmail.com FYI we are archiving compressed Linux disk images for VMs and hypervisors. A core problem is that you've got the worst sort of data for something like Git. Your files are huge, and being compressed, any effort to compress saved files or find duplicate strings between them is totally wasted. Your workload is anti-optimized for any source management system. Here's something that might work (ugh): Use Subversion, which I seem to recall will do delta encoding between versions of a single file but not *between* files. Have a directory (or directories) which contain all the big files. Whenever you change a big file, delete the old version and create the new version *under a different name* (so Subversion doesn't try to delta-encode the new version relative to the old one). Now, for your real files, keep a directory tree like normal, but for each of the big files, use a symbolic link (under the desired name) that points to the actual file (off in the storage directory). (Not svn mv, but just a filesystem move, so that Subversion doesn't try to connect different versions of a binary.) I *think* that will prevent Subversion from trying to do anything clever with big, low-redundancy binary files. You could probably write a script that would go through the structure and groom it into the proper shape to be committed: Move any big files in the real tree into the storage directory, replacing them with links, deleting any non-linked-to files in the storage directory, etc. The trick would be having a way to generate the name in the storage directory in a way that is uniquely determined by the file contents (and possibly modification date). You don't want to hash the whole file, that would be too slow... Dale -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] worlds slowest git repo- what to do?
On Fri, 16 May 2014 10:43:20 -0400 wor...@alum.mit.edu (Dale R. Worley) wrote: Sorry to replying to your message, not OP's. FYI we are archiving compressed Linux disk images for VMs and hypervisors. A core problem is that you've got the worst sort of data for something like Git. Your files are huge, and being compressed, any effort to compress saved files or find duplicate strings between them is totally wasted. Your workload is anti-optimized for any source management system. Here's something that might work (ugh): Use Subversion, which I seem to recall will do delta encoding between versions of a single file but not *between* files. [...] Mercurial does this as well. On the other hand, IIRC, after N revisions it does something like full checkpoint to make reconstructing past revisions faster. I think the OP is better off using something like rsnapshot [1] or rdiff-backup [2] for his task, or `rsync -H --no-inc-recursive` + `cp -alR` and bit of shell scripting. These tools provide file-level (in fact, inode-level) deduplication by hardlinking unchanged files. Dirvish and unison come to mind as well (I'm lazy to google the links to their sites, sorry). Another approach is to use a backup tool which performs block-level deduplication. For this, I can name obnam [3] and ZFS (snapshotting with block-level dedup turned on). Also not sure if this has been mentioned by other folks but there exist bup [4] and boar [5] which build on paradigms of VCS but are tailored to the needs of working with big binary files. This [6] is particularly insightful. 1. http://www.rsnapshot.org/ 2. http://www.nongnu.org/rdiff-backup/ 3. http://obnam.org/ 4. https://github.com/bup/bup 5. https://code.google.com/p/boar 6. https://github.com/bup/bup/blob/master/DESIGN -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] worlds slowest git repo- what to do?
On Fri, 16 May 2014 19:53:35 +0400 Konstantin Khomoutov flatw...@users.sourceforge.net wrote: [...] Another approach is to use a backup tool which performs block-level deduplication. For this, I can name obnam [3] and ZFS (snapshotting with block-level dedup turned on). Attic [1] does block-level dedup as well. 1. https://attic-backup.org/ -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] worlds slowest git repo- what to do?
On 05/16/2014 03:13 AM, Duy Nguyen wrote: On Fri, May 16, 2014 at 2:06 AM, Philip Oakley philipoak...@iee.org wrote: From: John Fisher fishook2...@gmail.com I assert based on one piece of evidence ( a post from a facebook dev) that I now have the worlds biggest and slowest git repository, and I am not a happy guy. I used to have the worlds biggest CVS repository, but CVS can't handle multi-G sized files. So I moved the repo to git, because we are using that for our new projects. goal: keep 150 G of files (mostly binary) from tiny sized to over 8G in a version-control system. I think your best bet so far is git-annex good, I am looking at that (or maybe bup) for dealing with huge files. I plan on resurrecting Junio's split-blob series to make core git handle huge files better, but there's no eta on that. The problem here is about file size, not the number of files, or history depth, right? When things here calm down, I could easily test the repo without the giant files, leaving 99% of files in the repo. There is hardly any history depth because these are releases, version controlled by directory name. As has been suggested I could be forced to abandon the version-control, even to the point of just using rsync. But I've been doing this with CVS for 10 years now and I hate to change or in any way move away fron KISS. Moving it to Git may not have been one of my better ideas... Probably known issues. But some elaboration would be nice (e.g. what operation is slow, how slow, some more detail characteristics of the repo..) in case new problems pop up. so far I have done add, commit, status, clone - commit and status are slow; add seems to depend on the files involved, clone seems to run at network speed. I can provide metrics later, see above. email me offline with what you want. John -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[git-users] worlds slowest git repo- what to do?
I assert based on one piece of evidence ( a post from a facebook dev) that I now have the worlds biggest and slowest git repository, and I am not a happy guy. I used to have the worlds biggest CVS repository, but CVS can't handle multi-G sized files. So I moved the repo to git, because we are using that for our new projects. goal: keep 150 G of files (mostly binary) from tiny sized to over 8G in a version-control system. problem: git is absurdly slow, think hours, on fast hardware. question: any suggestions beyond these- http://git-annex.branchable.com/ https://github.com/jedbrown/git-fat https://github.com/schacon/git-media http://code.google.com/p/boar/ subversion ? Thanks. -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [git-users] worlds slowest git repo- what to do?
On Thu, May 15, 2014 at 10:22:14AM -0700, John Fisher wrote: I assert based on one piece of evidence ( a post from a facebook dev) that I now have the worlds biggest and slowest git repository, and I am not a happy guy. I used to have the worlds biggest CVS repository, but CVS can't handle multi-G sized files. So I moved the repo to git, because we are using that for our new projects. goal: keep 150 G of files (mostly binary) from tiny sized to over 8G in a version-control system. problem: git is absurdly slow, think hours, on fast hardware. question: any suggestions beyond these- http://git-annex.branchable.com/ https://github.com/jedbrown/git-fat https://github.com/schacon/git-media http://code.google.com/p/boar/ subversion ? I think the general consensus is that git is for version control of source, i.e. text. In general putting large binary files into a DVCS is a bad idea since every clone will control ALL versions of ALL files. That makes for a lot of used space! Maybe a backup is what you actually need: https://github.com/bup/bup Then take another look at git-annex and how it can be used as a client to bup. /M -- Magnus Therning OpenPGP: 0xAB4DFBA4 email: mag...@therning.org jabber: mag...@therning.org twitter: magthe http://therning.org/magnus Code as if whoever maintains your program is a violent psychopath who knows where you live. -- Anonymous pgpVmruxaeMqb.pgp Description: PGP signature
Re: [git-users] worlds slowest git repo- what to do?
Thanks Philip, Magnus, Sam. There's no question that I have an outlier problem. But others must have similar, for instance master video files. I need both remote archiving/retrieval and version control. FYI we are archiving compressed Linux disk images for VMs and hypervisors. We are hardware-software makers and manufacturing blasts the disk images directly onto SSD drives. The rest of the repo is a varied mix of far smaller binaries and text files. running on a fast desktop it can take thousands of seconds to perform a status or commit, completely pegging one or more procs. On 05/15/2014 12:06 PM, Philip Oakley wrote: From: John Fisher fishook2...@gmail.com I assert based on one piece of evidence ( a post from a facebook dev) that I now have the worlds biggest and slowest git repository, ... At the moment some of the developers are looking to speed up some of the code on very large repos, though I think they are looking at code repos, rather than large file repos. They were looking for large repos to test some of the code upon ;-) I've copied the Git list should they want to make any suggestions. -- Philip -- You received this message because you are subscribed to the Google Groups Git for human beings group. To unsubscribe from this group and stop receiving emails from it, send an email to git-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.