Re: [git-users] worlds slowest git repo- what to do?

2014-05-19 Thread John Fisher

Duy Nguyen ,  I have 7700 files in the git repo. Add is much much faster 
than commit -m  text . My most-populous git repos has 57K files (its an 
operating system) and I have no issues with the 57K repo.



-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [git-users] worlds slowest git repo- what to do?

2014-05-16 Thread Dale R. Worley
 From: John Fisher fishook2...@gmail.com

 FYI we are archiving compressed Linux disk images for VMs and
 hypervisors.

A core problem is that you've got the worst sort of data for something
like Git.  Your files are huge, and being compressed, any effort to
compress saved files or find duplicate strings between them is totally
wasted.  Your workload is anti-optimized for any source management
system.

Here's something that might work (ugh):  Use Subversion, which I seem
to recall will do delta encoding between versions of a single file
but not *between* files.  Have a directory (or directories) which
contain all the big files.  Whenever you change a big file, delete the
old version and create the new version *under a different name* (so
Subversion doesn't try to delta-encode the new version relative to the
old one).  Now, for your real files, keep a directory tree like
normal, but for each of the big files, use a symbolic link (under the
desired name) that points to the actual file (off in the storage
directory).  (Not svn mv, but just a filesystem move, so that
Subversion doesn't try to connect different versions of a binary.)  I
*think* that will prevent Subversion from trying to do anything clever
with big, low-redundancy binary files.

You could probably write a script that would go through the structure
and groom it into the proper shape to be committed:  Move any big
files in the real tree into the storage directory, replacing them
with links, deleting any non-linked-to files in the storage directory,
etc.  The trick would be having a way to generate the name in the
storage directory in a way that is uniquely determined by the file
contents (and possibly modification date).  You don't want to hash the
whole file, that would be too slow...

Dale

-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [git-users] worlds slowest git repo- what to do?

2014-05-16 Thread Konstantin Khomoutov
On Fri, 16 May 2014 10:43:20 -0400
wor...@alum.mit.edu (Dale R. Worley) wrote:

Sorry to replying to your message, not OP's.

  FYI we are archiving compressed Linux disk images for VMs and
  hypervisors.
 
 A core problem is that you've got the worst sort of data for something
 like Git.  Your files are huge, and being compressed, any effort to
 compress saved files or find duplicate strings between them is totally
 wasted.  Your workload is anti-optimized for any source management
 system.
 
 Here's something that might work (ugh):  Use Subversion, which I seem
 to recall will do delta encoding between versions of a single file
 but not *between* files.
[...]

Mercurial does this as well.  On the other hand, IIRC, after N
revisions it does something like full checkpoint to make
reconstructing past revisions faster.

I think the OP is better off using something like rsnapshot [1] or
rdiff-backup [2] for his task, or `rsync -H --no-inc-recursive` +
`cp -alR` and bit of shell scripting.  These tools provide file-level
(in fact, inode-level) deduplication by hardlinking unchanged files.
Dirvish and unison come to mind as well (I'm lazy to google the links
to their sites, sorry).

Another approach is to use a backup tool which performs block-level
deduplication.  For this, I can name obnam [3] and ZFS (snapshotting
with block-level dedup turned on).

Also not sure if this has been mentioned by other folks but there
exist bup [4] and boar [5] which build on paradigms of VCS but are
tailored to the needs of working with big binary files.  This [6] is
particularly insightful.

1. http://www.rsnapshot.org/
2. http://www.nongnu.org/rdiff-backup/
3. http://obnam.org/
4. https://github.com/bup/bup
5. https://code.google.com/p/boar
6. https://github.com/bup/bup/blob/master/DESIGN

-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [git-users] worlds slowest git repo- what to do?

2014-05-16 Thread Konstantin Khomoutov
On Fri, 16 May 2014 19:53:35 +0400
Konstantin Khomoutov flatw...@users.sourceforge.net wrote:

[...]
 Another approach is to use a backup tool which performs block-level
 deduplication.  For this, I can name obnam [3] and ZFS (snapshotting
 with block-level dedup turned on).

Attic [1] does block-level dedup as well.

1. https://attic-backup.org/

-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [git-users] worlds slowest git repo- what to do?

2014-05-16 Thread John Fisher

On 05/16/2014 03:13 AM, Duy Nguyen wrote:
 On Fri, May 16, 2014 at 2:06 AM, Philip Oakley philipoak...@iee.org wrote:
 From: John Fisher fishook2...@gmail.com
 I assert based on one piece of evidence ( a post from a facebook dev) that
 I now have the worlds biggest and slowest git
 repository, and I am not a happy guy. I used to have the worlds biggest
 CVS repository, but CVS can't handle multi-G
 sized files. So I moved the repo to git, because we are using that for our
 new projects.

 goal:
 keep 150 G of files (mostly binary) from tiny sized to over 8G in a
 version-control system.
 I think your best bet so far is git-annex 

good, I am  looking at that

 (or maybe bup) for dealing
 with huge files. I plan on resurrecting Junio's split-blob series to
 make core git handle huge files better, but there's no eta on that.
 The problem here is about file size, not the number of files, or
 history depth, right?

When things here calm down, I could easily test the repo without the giant 
files, leaving 99% of files in the repo.
There is hardly any history depth because these are releases, version 
controlled by directory name. As has been
suggested I could be forced to abandon the version-control, even to the point 
of just using rsync.  But I've been doing
this with CVS for 10 years now and I hate to change or in any way move away 
fron KISS. Moving it to Git may not have
been one of my better ideas...


 Probably known issues. But some elaboration would be nice (e.g. what 
 operation is slow, how slow, some more detail
 characteristics of the repo..) in case new problems pop up. 

so far I have done add, commit, status, clone - commit and status are slow; add 
seems to depend on the files involved,
clone seems to run at network speed.
I can provide metrics later, see above. email me offline with what you want.

John

-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[git-users] worlds slowest git repo- what to do?

2014-05-15 Thread John Fisher
I assert based on one piece of evidence ( a post from a facebook dev) that I 
now have the worlds biggest and slowest git
repository, and I am not a happy guy. I used to have the worlds biggest CVS 
repository, but CVS can't handle multi-G
sized files. So I moved the repo to git, because we are using that for our new 
projects.

goal:
keep 150 G of files (mostly binary) from tiny sized to over 8G in a 
version-control system.

problem:
git is absurdly slow, think hours, on fast hardware.

question:
any suggestions beyond these-
http://git-annex.branchable.com/
https://github.com/jedbrown/git-fat
https://github.com/schacon/git-media
http://code.google.com/p/boar/
subversion 

?


Thanks.

-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [git-users] worlds slowest git repo- what to do?

2014-05-15 Thread Magnus Therning
On Thu, May 15, 2014 at 10:22:14AM -0700, John Fisher wrote:
 I assert based on one piece of evidence ( a post from a facebook
 dev) that I now have the worlds biggest and slowest git repository,
 and I am not a happy guy. I used to have the worlds biggest CVS
 repository, but CVS can't handle multi-G sized files. So I moved the
 repo to git, because we are using that for our new projects.
 
 goal:
 keep 150 G of files (mostly binary) from tiny sized to over 8G in a
 version-control system.
 
 problem:
 git is absurdly slow, think hours, on fast hardware.
 
 question:
 any suggestions beyond these-
 http://git-annex.branchable.com/
 https://github.com/jedbrown/git-fat
 https://github.com/schacon/git-media
 http://code.google.com/p/boar/
 subversion 
 ?

I think the general consensus is that git is for version control of
source, i.e. text.  In general putting large binary files into a DVCS
is a bad idea since every clone will control ALL versions of ALL
files.  That makes for a lot of used space!

Maybe a backup is what you actually need: https://github.com/bup/bup
Then take another look at git-annex and how it can be used as a client
to bup.

/M

-- 
Magnus Therning  OpenPGP: 0xAB4DFBA4 
email: mag...@therning.org   jabber: mag...@therning.org
twitter: magthe   http://therning.org/magnus

Code as if whoever maintains your program is a violent psychopath who knows
where you live.
 -- Anonymous


pgpVmruxaeMqb.pgp
Description: PGP signature


Re: [git-users] worlds slowest git repo- what to do?

2014-05-15 Thread John Fisher

Thanks Philip, Magnus, Sam. There's no question that I have an outlier problem. 
But others must have similar, for
instance master video files.

 I need both remote archiving/retrieval and version control. FYI we are 
archiving compressed Linux disk images for VMs
and hypervisors. We are hardware-software makers and manufacturing blasts the 
disk images directly onto SSD drives. The
rest of the repo is a varied mix of far smaller binaries and text files.

running on a fast desktop it can take thousands of seconds to perform a status 
or commit, completely pegging one or more
procs.

On 05/15/2014 12:06 PM, Philip Oakley wrote:
 From: John Fisher fishook2...@gmail.com
 I assert based on one piece of evidence ( a post from a facebook dev) that I 
 now have the worlds biggest and slowest git
 repository, ...

 At the moment some of the developers are looking to speed up some of the code 
 on very large repos, though I think they
 are looking at code repos, rather than large file repos. They were looking 
 for large repos to test some of the code
 upon ;-)

 I've copied the Git list should they want to make any suggestions.
 -- 
 Philip


-- 
You received this message because you are subscribed to the Google Groups Git 
for human beings group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.