On 09/12/2014 03:11 PM, Pádraig Brady wrote: > On 09/12/2014 04:42 AM, [email protected] wrote: >> This post has made it to Hacker News[1]. >> >> We have discussed optimization possibilities a bit, and I made the >> suggestion to replace the usage of a hash table in cp with sorting a >> list. >> >> For example: walk the source tree and write a list of ino/dev/path to >> a temporary file, then sort that file according to ino/dev (e.g. using >> GNU sort, which I seem to remember is already using a memory-efficient >> algorithm (i.e. works well with files much bigger than RAM)?), then >> parse the file back and copy the first path of every group with the >> same ino/dev and link the rest. >> >> The assumption is that sorting a list requires much less RAM to be >> efficient than the hash table. (I can't find my copy of TAOCP right >> now, I think it describes solutions.) >> >> Christian. >> >> [1] https://news.ycombinator.com/item?id=8305283 > > The main problem here is that you need the full set somewhere > at some stage to identify the hardlinks among the full set. > > Using an external sorted list would result in a larger disk usage > (than having cp swap out its mem), but probably would benefit > from better mem locality. I.E. while more expensive CPU wise, > sort(1) would use a divide and conquer approach to achieve better > memory locality, while cp updates its hash as it copies, which is the > best approach assuming no swapping, but that results in essentially > random parts of mem (swap) being updated as it runs. > > That resulted on the 10G RAM system in around 30/420 inodes/direntries > being copied per second on average over the 10 day run. > > It's worth mentioning that mechanical disks with their orders of magnitude > slower random access latencies are going away, so we must consider that > along with the frequency of this use case when making any changes to > the complexity of the code. > > If we were to address this, the two approaches would be to > 1. reduce the size of the hash > 2. employ locality-sensitive hashing > > Looking at 1, I see there was an attempt to reduce the size > of the hash by excluding traversed source directories: > http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035 > > Now Rasmus' analysis suggests that this didn't happen for him, > resulting in a 350% mem increase for his case, which would have been > more than enough to move the working set from RAM to swap. > > The patch above didn't cater for the fact that dirs > generally have st_nlink > 1, which is fixed up in the following patch: > > Note there is a caveat where this will now not detect "hardlinked dirs" > as reported (intermittently?) on netapp file systems, and thus not > exit with an error, and instead just create new directories in that case. > Perhaps that's what we should be doing anyway? > http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6 > > diff --git a/src/copy.c b/src/copy.c > index a9561c6..a2e297f 100644 > --- a/src/copy.c > +++ b/src/copy.c > @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const > *dst_name, > } > else if (x->preserve_links > && !x->hard_link > - && (1 < src_sb.st_nlink > + && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode)) > || (command_line_arg > && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS) > || x->dereference == DEREF_ALWAYS)) > > thanks, > Pádraig. > >
BTW there is a lot of redundancy in all the stored paths, which could be improved by using a prefix compression method. Pádraig.
