On 09/12/2014 04:42 AM, [email protected] wrote: > This post has made it to Hacker News[1]. > > We have discussed optimization possibilities a bit, and I made the > suggestion to replace the usage of a hash table in cp with sorting a > list. > > For example: walk the source tree and write a list of ino/dev/path to > a temporary file, then sort that file according to ino/dev (e.g. using > GNU sort, which I seem to remember is already using a memory-efficient > algorithm (i.e. works well with files much bigger than RAM)?), then > parse the file back and copy the first path of every group with the > same ino/dev and link the rest. > > The assumption is that sorting a list requires much less RAM to be > efficient than the hash table. (I can't find my copy of TAOCP right > now, I think it describes solutions.) > > Christian. > > [1] https://news.ycombinator.com/item?id=8305283
The main problem here is that you need the full set somewhere at some stage to identify the hardlinks among the full set. Using an external sorted list would result in a larger disk usage (than having cp swap out its mem), but probably would benefit from better mem locality. I.E. while more expensive CPU wise, sort(1) would use a divide and conquer approach to achieve better memory locality, while cp updates its hash as it copies, which is the best approach assuming no swapping, but that results in essentially random parts of mem (swap) being updated as it runs. That resulted on the 10G RAM system in around 30/420 inodes/direntries being copied per second on average over the 10 day run. It's worth mentioning that mechanical disks with their orders of magnitude slower random access latencies are going away, so we must consider that along with the frequency of this use case when making any changes to the complexity of the code. If we were to address this, the two approaches would be to 1. reduce the size of the hash 2. employ locality-sensitive hashing Looking at 1, I see there was an attempt to reduce the size of the hash by excluding traversed source directories: http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035 Now Rasmus' analysis suggests that this didn't happen for him, resulting in a 350% mem increase for his case, which would have been more than enough to move the working set from RAM to swap. The patch above didn't cater for the fact that dirs generally have st_nlink > 1, which is fixed up in the following patch: Note there is a caveat where this will now not detect "hardlinked dirs" as reported (intermittently?) on netapp file systems, and thus not exit with an error, and instead just create new directories in that case. Perhaps that's what we should be doing anyway? http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6 diff --git a/src/copy.c b/src/copy.c index a9561c6..a2e297f 100644 --- a/src/copy.c +++ b/src/copy.c @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const *dst_name, } else if (x->preserve_links && !x->hard_link - && (1 < src_sb.st_nlink + && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode)) || (command_line_arg && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS) || x->dereference == DEREF_ALWAYS)) thanks, Pádraig.
