Sergey,

Last week, we had a 32-bit build of tar 1.22.90 (with some irrelevant
patches) bump into the 3 GB process address space limit and fail during
an incremental run (somehow a preceding full backup run succeeded).
A temporary downgrade to 1.20 let the incremental run succeed, with
"only" around 300 MB of memory used.

I tracked the problem down to this change:

2009-08-07  Sergey Poznyakoff  <[email protected]>
[...]
        * src/misc.c: Include canonicalize.h
        (zap_slashes, normalize_filename): New functions.

It turns out that canonicalize_filename_mode(), which is used by the
"new functions" mentioned above, allocates memory for at least PATH_MAX
bytes (4096 bytes on the system mentioned above), regardless of actual
length of the pathname.  The attached patch corrects or works around
this in a simple manner (optimized for small patch size).  With the
patch, memory usage for a similar incremental run of tar is reduced to
around 400 MB - still higher than 1.20's, but a lot lower than 1.22.90's
without the patch.

Besides fixing canonicalize_filename_mode() or its uses, maybe "struct
directory" could be made more compact again?  For example, maybe "name"
could be allocated along with the struct like it was before instead of
being referenced through a pointer?  Maybe "caname" could be allocated
along with the struct too?  The pointer field would have to remain, but
at least we'd save on malloc overhead (we will have fewer allocations).
(Changing "caname" into a 32-bit offset relative to the allocation start
would save another 4 bytes on 64-bit systems, but it adds complexity.
So I am merely suggesting re-pointing the pointer at this stage.)

Thanks,

Alexander

P.S. For those reading the list archives, our current tar patches are
found at http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/tar/
This patch corrects highly non-optimal memory allocation by
canonicalize_filename_mode(), which got exposed with:

2009-08-07  Sergey Poznyakoff  <[email protected]>
[...]
        * src/misc.c: Include canonicalize.h
        (zap_slashes, normalize_filename): New functions.

On a specific test case (a tree with around 3500 sub-directories with a
total of around 58,000 files), this reduces tar's memory usage from
32 MB to 4.5 MB for the initial full backup run ("-g" enabled), and from
19 MB to 5.5 MB for a subsequent incremental run.

On a real-world system with around 370,000 directories and 2.3 million
files where the problem was first spotted, an incremental run of a
32-bit build of tar 1.22.90 bumped into the 3 GB process address space
limit and failed, whereas a build with this patch applied uses around
400 MB during incremental runs and around 300 MB during initial full
backup runs.

--- tar-1.22.90/gnu/canonicalize.c.orig 2009-08-07 11:55:47 +0000
+++ tar-1.22.90/gnu/canonicalize.c      2009-12-08 15:50:40 +0000
@@ -161,6 +161,7 @@ canonicalize_filename_mode (const char *
   char const *end;
   char const *rname_limit;
   size_t extra_len = 0;
+  size_t actual_size;
   Hash_table *ht = NULL;
 
   if (name == NULL)
@@ -325,6 +326,10 @@ canonicalize_filename_mode (const char *
     --dest;
   *dest = '\0';
 
+  actual_size = strlen(rname) + 1;
+  if (rname_limit - rname > actual_size)
+    rname = xrealloc (rname, actual_size);
+
   free (extra_buf);
   if (ht)
     hash_free (ht);

Reply via email to