Sergey, Last week, we had a 32-bit build of tar 1.22.90 (with some irrelevant patches) bump into the 3 GB process address space limit and fail during an incremental run (somehow a preceding full backup run succeeded). A temporary downgrade to 1.20 let the incremental run succeed, with "only" around 300 MB of memory used.
I tracked the problem down to this change: 2009-08-07 Sergey Poznyakoff <[email protected]> [...] * src/misc.c: Include canonicalize.h (zap_slashes, normalize_filename): New functions. It turns out that canonicalize_filename_mode(), which is used by the "new functions" mentioned above, allocates memory for at least PATH_MAX bytes (4096 bytes on the system mentioned above), regardless of actual length of the pathname. The attached patch corrects or works around this in a simple manner (optimized for small patch size). With the patch, memory usage for a similar incremental run of tar is reduced to around 400 MB - still higher than 1.20's, but a lot lower than 1.22.90's without the patch. Besides fixing canonicalize_filename_mode() or its uses, maybe "struct directory" could be made more compact again? For example, maybe "name" could be allocated along with the struct like it was before instead of being referenced through a pointer? Maybe "caname" could be allocated along with the struct too? The pointer field would have to remain, but at least we'd save on malloc overhead (we will have fewer allocations). (Changing "caname" into a 32-bit offset relative to the allocation start would save another 4 bytes on 64-bit systems, but it adds complexity. So I am merely suggesting re-pointing the pointer at this stage.) Thanks, Alexander P.S. For those reading the list archives, our current tar patches are found at http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/tar/
This patch corrects highly non-optimal memory allocation by canonicalize_filename_mode(), which got exposed with: 2009-08-07 Sergey Poznyakoff <[email protected]> [...] * src/misc.c: Include canonicalize.h (zap_slashes, normalize_filename): New functions. On a specific test case (a tree with around 3500 sub-directories with a total of around 58,000 files), this reduces tar's memory usage from 32 MB to 4.5 MB for the initial full backup run ("-g" enabled), and from 19 MB to 5.5 MB for a subsequent incremental run. On a real-world system with around 370,000 directories and 2.3 million files where the problem was first spotted, an incremental run of a 32-bit build of tar 1.22.90 bumped into the 3 GB process address space limit and failed, whereas a build with this patch applied uses around 400 MB during incremental runs and around 300 MB during initial full backup runs. --- tar-1.22.90/gnu/canonicalize.c.orig 2009-08-07 11:55:47 +0000 +++ tar-1.22.90/gnu/canonicalize.c 2009-12-08 15:50:40 +0000 @@ -161,6 +161,7 @@ canonicalize_filename_mode (const char * char const *end; char const *rname_limit; size_t extra_len = 0; + size_t actual_size; Hash_table *ht = NULL; if (name == NULL) @@ -325,6 +326,10 @@ canonicalize_filename_mode (const char * --dest; *dest = '\0'; + actual_size = strlen(rname) + 1; + if (rname_limit - rname > actual_size) + rname = xrealloc (rname, actual_size); + free (extra_buf); if (ht) hash_free (ht);
