On 09/12/2014 03:11 PM, Pádraig Brady wrote: > On 09/12/2014 04:42 AM, [email protected] wrote: >> This post has made it to Hacker News[1]. >> >> We have discussed optimization possibilities a bit, and I made the >> suggestion to replace the usage of a hash table in cp with sorting a >> list. >> >> For example: walk the source tree and write a list of ino/dev/path to >> a temporary file, then sort that file according to ino/dev (e.g. using >> GNU sort, which I seem to remember is already using a memory-efficient >> algorithm (i.e. works well with files much bigger than RAM)?), then >> parse the file back and copy the first path of every group with the >> same ino/dev and link the rest. >> >> The assumption is that sorting a list requires much less RAM to be >> efficient than the hash table. (I can't find my copy of TAOCP right >> now, I think it describes solutions.) >> >> Christian. >> >> [1] https://news.ycombinator.com/item?id=8305283 > > The main problem here is that you need the full set somewhere > at some stage to identify the hardlinks among the full set. > > Using an external sorted list would result in a larger disk usage > (than having cp swap out its mem), but probably would benefit > from better mem locality. I.E. while more expensive CPU wise, > sort(1) would use a divide and conquer approach to achieve better > memory locality, while cp updates its hash as it copies, which is the > best approach assuming no swapping, but that results in essentially > random parts of mem (swap) being updated as it runs. > > That resulted on the 10G RAM system in around 30/420 inodes/direntries > being copied per second on average over the 10 day run. > > It's worth mentioning that mechanical disks with their orders of magnitude > slower random access latencies are going away, so we must consider that > along with the frequency of this use case when making any changes to > the complexity of the code. > > If we were to address this, the two approaches would be to > 1. reduce the size of the hash > 2. employ locality-sensitive hashing > > Looking at 1, I see there was an attempt to reduce the size > of the hash by excluding traversed source directories: > http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035 > > Now Rasmus' analysis suggests that this didn't happen for him, > resulting in a 350% mem increase for his case, which would have been > more than enough to move the working set from RAM to swap. > > The patch above didn't cater for the fact that dirs > generally have st_nlink > 1, which is fixed up in the following patch: > > Note there is a caveat where this will now not detect "hardlinked dirs" > as reported (intermittently?) on netapp file systems, and thus not > exit with an error, and instead just create new directories in that case. > Perhaps that's what we should be doing anyway? > http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6#l2173 > > diff --git a/src/copy.c b/src/copy.c > index a9561c6..a2e297f 100644 > --- a/src/copy.c > +++ b/src/copy.c > @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const > *dst_name, > } > else if (x->preserve_links > && !x->hard_link > - && (1 < src_sb.st_nlink > + && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode)) > || (command_line_arg > && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS) > || x->dereference == DEREF_ALWAYS))
To get an idea of the impact of removing the dirs from the hash as per the above commit, I did a search for the corresponding "cp: will not create hardlink" error: https://www.google.com/search?q="cp:+will+not+create+hardlink" There were a few correct hits but also a few cases where the error should not have been emitted! The two patches attached fix some of these and change cp to behave as follows: $ mkdir d1 dest $ cp -a d1 d1 dest cp: warning: source directory ‘d1’ specified more than once $ ln -s d1 d_sl $ cp -aH d1 d_sl dest $ ls dest d1 d_sl As for the patch above, I'm still holding off due to wariness about dir hardlinks which OS X supports it seems (http://stackoverflow.com/q/80875/4421) and other cases I still need to check, like bind mounted dirs causing cycles. thanks, Pádraig.
>From f105bbf4a9648816de5af7e85a510e3415c8105e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Wed, 17 Sep 2014 18:50:08 +0100 Subject: [PATCH 1/2] cp: issue correct warning and ignore duplicate source dirs * src/copy.c (copy_internal): Handle the case where we have the same destination directory as already encountered, which can only be due to the corresponding source directory being specified multiple times. * tests/cp/duplicate-sources.sh: Add a test for the new multiply specified directory case, and the existing multiply specified file case. * tests/local.mk: Reference the new test. * NEWS: Mention the bug fix. --- NEWS | 6 ++++++ src/copy.c | 11 +++++++++++ tests/cp/duplicate-sources.sh | 35 +++++++++++++++++++++++++++++++++++ tests/local.mk | 1 + 4 files changed, 53 insertions(+), 0 deletions(-) create mode 100755 tests/cp/duplicate-sources.sh diff --git a/NEWS b/NEWS index ba0d3d7..ede5fd1 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,12 @@ GNU coreutils NEWS -*- outline -*- * Noteworthy changes in release ?.? (????-??-??) [?] +** Bug fixes + + cp no longer issues an incorrect warning about directory hardlinks when a + source directory is specified multiple times. Instead a warning is issued + and the duplicate source directory is skipped, as already the case for files. + ** New features chroot accepts the new --skip-chdir option to not change the working directory diff --git a/src/copy.c b/src/copy.c index a9561c6..4e0d266 100644 --- a/src/copy.c +++ b/src/copy.c @@ -2187,6 +2187,17 @@ copy_internal (char const *src_name, char const *dst_name, *copy_into_self = true; goto un_backup; } + else if (same_name (dst_name, earlier_file)) + { + error (0, 0, _("warning: source directory %s " + "specified more than once"), + quote (top_level_src_name)); + /* We only do backups in move mode and for non dirs, + and in move mode this won't be the issue as the source will + be missing for subsequent attempts. + There we just warn and return here. */ + return true; + } else if (x->dereference == DEREF_ALWAYS) { /* This happens when e.g., encountering a directory for the diff --git a/tests/cp/duplicate-sources.sh b/tests/cp/duplicate-sources.sh new file mode 100755 index 0000000..e15c6d2 --- /dev/null +++ b/tests/cp/duplicate-sources.sh @@ -0,0 +1,35 @@ +#!/bin/sh +# Ensure cp warns about but otherwise ignores source +# items specified multiple times. + +# Copyright (C) 2014 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. + +. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src +print_ver_ cp + +mkdir a b || framework_failure_ +touch f || framework_failure_ + +cp -a a a f f b 2>err || fail=1 + +cat <<EOF >exp +cp: warning: source directory 'a' specified more than once +cp: warning: source file 'f' specified more than once +EOF + +compare exp err || fail=1 + +Exit $fail diff --git a/tests/local.mk b/tests/local.mk index 1edaaf4..97bf5ed 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -427,6 +427,7 @@ all_tests = \ tests/cp/dir-rm-dest.sh \ tests/cp/dir-slash.sh \ tests/cp/dir-vs-file.sh \ + tests/cp/duplicate-sources.sh \ tests/cp/existing-perm-dir.sh \ tests/cp/existing-perm-race.sh \ tests/cp/fail-perm.sh \ -- 1.7.7.6 >From 469988c772a38c0e44bd7503ed5b98c8db4abea9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Thu, 18 Sep 2014 11:31:22 +0100 Subject: [PATCH 2/2] cp: fix handling of -H with multiply specified source dirs Following on from commit v5.92-729-g130dd06, also avoid the erroneous directory hardlink warning with -H. * src/copy.c (copy_internal): Also handle the -H case for command line arguments. * tests/cp/duplicate-sources.sh: Augment the test case. * NEWS: Augment the news entry. --- NEWS | 5 +++-- src/copy.c | 4 +++- tests/cp/duplicate-sources.sh | 13 +++++++++++-- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/NEWS b/NEWS index ede5fd1..60117aa 100644 --- a/NEWS +++ b/NEWS @@ -5,8 +5,9 @@ GNU coreutils NEWS -*- outline -*- ** Bug fixes cp no longer issues an incorrect warning about directory hardlinks when a - source directory is specified multiple times. Instead a warning is issued - and the duplicate source directory is skipped, as already the case for files. + source directory is specified multiple times. Now, consistent with other + file types, a warning is issued for source directories with duplicate names, + or with -H the directory is copied again using the symlink name. ** New features diff --git a/src/copy.c b/src/copy.c index 4e0d266..b7baee4 100644 --- a/src/copy.c +++ b/src/copy.c @@ -2198,7 +2198,9 @@ copy_internal (char const *src_name, char const *dst_name, There we just warn and return here. */ return true; } - else if (x->dereference == DEREF_ALWAYS) + else if (x->dereference == DEREF_ALWAYS + || (command_line_arg + && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)) { /* This happens when e.g., encountering a directory for the second or subsequent time via symlinks when cp is invoked diff --git a/tests/cp/duplicate-sources.sh b/tests/cp/duplicate-sources.sh index e15c6d2..35b2e46 100755 --- a/tests/cp/duplicate-sources.sh +++ b/tests/cp/duplicate-sources.sh @@ -20,10 +20,19 @@ . "${srcdir=.}/tests/init.sh"; path_prepend_ ./src print_ver_ cp -mkdir a b || framework_failure_ +mkdir a || framework_failure_ touch f || framework_failure_ -cp -a a a f f b 2>err || fail=1 +# verify multiple files and dir sources only warned about +mkdir dest || framework_failure_ +cp -a a a f f dest 2>err || fail=1 +rm -Rf dest || framework_failure_ + +# verify multiple dirs and files with different names copied +mkdir dest || framework_failure_ +ln -s a al || framework_failure_ +ln -s f fl || framework_failure_ +cp -aH a al f fl dest 2>>err || fail=1 cat <<EOF >exp cp: warning: source directory 'a' specified more than once -- 1.7.7.6
