(Scenario at the top, concrete questions below.)

I'm in the progress of migrating an SVN repository to Git, including 
history, using "git svn". The SVN repository currently contains a number of 
large (data) files:
- five binary files of 100 MB to 400 MB, with up to 17 revisions,
- eleven binary files in the range of 10 MB to 50 MB, with up to 10 
- the largest non-binary file has a size of 31 MB. 

The large binary files are in the history, so it's not quite easy for me to 
get rid of them. (Rewriting history to move the largest files out of the 
repository into a dedicated "large file store" could be an option, though, 
if absolutely necessary.)

Git has somewhat of a bad reputation regarding large binary files, so I've 
done some research. I have found this recent thread: 
https://groups.google.com/d/msg/git-users/EIGoSe1eIYc/dL8voHjF4RUJ, but I'm 
not able to derive concrete measures from it yet, so I've decided to ask 
here myself. The clients are using Git for Windows, which is compiled for 
x86 (32 bit process). My research has shown that, apparently, Git on 32 
bits can have problems with out of memory errors when diffing, compressing, 
or packing files that are too large to fit in the limited memory (twice) a 
32 bit process can access.

As I understand it, the diffing problem shouldn't affect me because a) 
binary files don't need to be diffed anyway and b) even loading a 400 MB 
file into memory twice for diffing is not going to be a problem in a 32 bit 

The compression problem can be tackled by setting core.bigFileThreshold to 
something smaller than the default 256 MB because Git won't try to compress 
files larger than this. This, however, does have the disadvantage that, for 
example, the 17 revisions of a 150 MB file would amount to about 2.5 GB of 
data even if the revisions could be compressed.

The packing problem can be tackled by setting pack.packSizeLimit to 
something small enough to limit the maximum pack size. (The default in Git 
for Windows is 2g.) pack.windowMemory and other pack options seem to play a 
role as well, although I don't understand exactly how. However, these pack 
settings do not affect the size of pack files created for pushing and 
pulling. Therefore, pushing and pulling might remain a problem.
So, that's the result of my research so far. Now for my questions:

1. Have I got everything right in my analysis above? Am I missing anything 
important, any problems I should expect?
2. Would you recommend setting core.bigFileThreshold, pack.packSizeLimit or 
other options to non-default values proactively on all clients, or should I 
rather postpone this until (if ever) we're experiencing problems? If I 
don't set these values proactively, is there a chance that the Git 
repository could be ruined?
-- What is a good value for core.bigFileThreshold, given my concrete binary 
files of 10 to 400 MB, some of which have up to 17 revisions?
-- What is a good value for pack.packSizeLimit? Git for Windows defaults it 
to 2g, is there any reason not to leave it at that?
3. Since pack.packSizeLimit does not affect the packs created for pulling 
and pushing - what problems can I expect there? How could I tackle them?
4. "git repack -afd" and "git gc" currently fail with an out of memory 
error on the migrated repository [1][2]. Should I worry about this?
-- I can make "git repack -afd" work by passing "--window-memory 750m" to 
the command. After that, git gc works fine again) Again, is setting 
pack.windowMemory to 750m something I should do proactively?

Thanks, best regards,

[1] $ git repack -afd
Counting objects: 189121, done.
Delta compression using up to 8 threads.
warning: suboptimal pack - out of memory)
fatal: Out of memory, malloc failed (tried to allocate 331852630 bytes)

[2] $ git gc
Counting objects: 189121, done.
Delta compression using up to 8 threads.
fatal: Out of memory, malloc failed (tried to allocate 73267908 bytes)
error: failed to run repack

You received this message because you are subscribed to the Google Groups "Git 
for human beings" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to git-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to