On Tue, Jan 22, 2019 at 06:07:14PM -0400, Joey Hess wrote:
> joey@darkstar:~/tmp/t> ls -l big-file
> -rw-r--r-- 1 joey joey 11811160064 Jan 22 17:48 big-file
> joey@darkstar:~/tmp/t> git status
> fatal: Out of memory, realloc failed
>
> This file is checked into git, but using a smudge/clean filter, so the actual
> data checked into git is a hash. I did so using git-annex v7 mode, but I
> suppose git lfs would cause the same problem.
>
> [pid 6573] mmap(NULL, 11811164160, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
>
> Why status needs to mmap a large file that is not modified
> and that is configured to pass through smudge/clean, I don't know.
> It seems like it should be possible for status work in this situation.
One minor point: I don't think this is us mmap-ing the file. The
descriptor is -1, and Git never uses PROT_WRITE. This is likely your
libc using mmap to fulfill a malloc() request.
That said, it just turns the question into: why did Git try to malloc
that many bytes? If I reproduce your example (using a 100MB file) and
set GIT_ALLOC_LIMIT in the environment, the backtrace to die() is:
#1 0x0000555555786d65 in memory_limit_check (size=104857601, gentle=0) at
wrapper.c:27
#2 0x0000555555787084 in xrealloc (ptr=0x0, size=104857601) at wrapper.c:137
#3 0x000055555575612e in strbuf_grow (sb=0x7fffffffdbf0, extra=104857600) at
strbuf.c:98
#4 0x000055555575731a in strbuf_read (sb=0x7fffffffdbf0, fd=4,
hint=104857600) at strbuf.c:429
#5 0x0000555555664a1f in apply_single_file_filter (path=0x5555558c787c
"foo.rand", ...)
#6 0x0000555555665321 in apply_filter (path=0x5555558c787c "foo.rand", ...)
...
Looking at apply_single_file_filter(), it's not the _original_ file that
it's trying to store, but rather the data coming back from the filter.
It's just that we use the original file size as a hint!
In this case (and I'd venture to say in most gigantic-file cases) it's
much larger than we need, to the point of causing a problem.
In other words, I think this patch fixes your problem:
diff --git a/convert.c b/convert.c
index 0d89ae7c23..85aebe2ed3 100644
--- a/convert.c
+++ b/convert.c
@@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const
char *src, size_t le
if (start_async(&async))
return 0; /* error was already reported */
- if (strbuf_read(&nbuf, async.out, len) < 0) {
+ if (strbuf_read(&nbuf, async.out, 0) < 0) {
err = error(_("read from external filter '%s' failed"), cmd);
}
if (close(async.out)) {
though possibly we should actually continue to use the file size as a
hint up to a certain point, which avoids reallocations for more "normal"
filters where the input and output sizes are in the same ballpark.
Just off the top of my head, something like:
/* guess that the filtered output will be the same size as the original */
hint = len;
/* allocate 10% extra in case the clean size is slightly larger */
hint *= 1.1;
/*
* in any case, never go higher than half of core.bigfileThreshold.
* We'd like to avoid allocating more bytes than that, and that still
* gives us room for our strbuf to preemptively double if our guess is
* just a little on the low side.
*/
if (hint > big_file_threshold / 2)
hint = big_file_threshold / 2;
But to be honest, I have no idea if that would even produce measurable
benefits over simply growing the strbuf from scratch (i.e., hint==0).
-Peff