Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)

Pádraig Brady Fri, 12 Sep 2014 07:12:24 -0700

On 09/12/2014 04:42 AM, [email protected] wrote:
> This post has made it to Hacker News[1].
> 
> We have discussed optimization possibilities a bit, and I made the
> suggestion to replace the usage of a hash table in cp with sorting a
> list.
> 
> For example: walk the source tree and write a list of ino/dev/path to
> a temporary file, then sort that file according to ino/dev (e.g. using
> GNU sort, which I seem to remember is already using a memory-efficient
> algorithm (i.e. works well with files much bigger than RAM)?), then
> parse the file back and copy the first path of every group with the
> same ino/dev and link the rest.
> 
> The assumption is that sorting a list requires much less RAM to be
> efficient than the hash table. (I can't find my copy of TAOCP right
> now, I think it describes solutions.)
> 
> Christian.
> 
> [1] https://news.ycombinator.com/item?id=8305283


The main problem here is that you need the full set somewhere
at some stage to identify the hardlinks among the full set.

Using an external sorted list would result in a larger disk usage
(than having cp swap out its mem), but probably would benefit
from better mem locality.  I.E. while more expensive CPU wise,
sort(1) would use a divide and conquer approach to achieve better
memory locality, while cp updates its hash as it copies, which is the
best approach assuming no swapping, but that results in essentially
random parts of mem (swap) being updated as it runs.

That resulted on the 10G RAM system in around 30/420 inodes/direntries
being copied per second on average over the 10 day run.

It's worth mentioning that mechanical disks with their orders of magnitude
slower random access latencies are going away, so we must consider that
along with the frequency of this use case when making any changes to
the complexity of the code.

If we were to address this, the two approaches would be to
1. reduce the size of the hash
2. employ locality-sensitive hashing

Looking at 1, I see there was an attempt to reduce the size
of the hash by excluding traversed source directories:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035

Now Rasmus' analysis suggests that this didn't happen for him,
resulting in a 350% mem increase for his case, which would have been
more than enough to move the working set from RAM to swap.

The patch above didn't cater for the fact that dirs
generally have st_nlink > 1, which is fixed up in the following patch:

Note there is a caveat where this will now not detect "hardlinked dirs"
as reported (intermittently?) on netapp file systems, and thus not
exit with an error, and instead just create new directories in that case.
Perhaps that's what we should be doing anyway?
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6

diff --git a/src/copy.c b/src/copy.c
index a9561c6..a2e297f 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const *dst_name,
     }
   else if (x->preserve_links
            && !x->hard_link
-           && (1 < src_sb.st_nlink
+           && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode))
                || (command_line_arg
                    && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
                || x->dereference == DEREF_ALWAYS))

thanks,
Pádraig.

Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)

Reply via email to