On 09/12/2014 03:11 PM, Pádraig Brady wrote:
> On 09/12/2014 04:42 AM, [email protected] wrote:
>> This post has made it to Hacker News[1].
>>
>> We have discussed optimization possibilities a bit, and I made the
>> suggestion to replace the usage of a hash table in cp with sorting a
>> list.
>>
>> For example: walk the source tree and write a list of ino/dev/path to
>> a temporary file, then sort that file according to ino/dev (e.g. using
>> GNU sort, which I seem to remember is already using a memory-efficient
>> algorithm (i.e. works well with files much bigger than RAM)?), then
>> parse the file back and copy the first path of every group with the
>> same ino/dev and link the rest.
>>
>> The assumption is that sorting a list requires much less RAM to be
>> efficient than the hash table. (I can't find my copy of TAOCP right
>> now, I think it describes solutions.)
>>
>> Christian.
>>
>> [1] https://news.ycombinator.com/item?id=8305283
> 
> The main problem here is that you need the full set somewhere
> at some stage to identify the hardlinks among the full set.
> 
> Using an external sorted list would result in a larger disk usage
> (than having cp swap out its mem), but probably would benefit
> from better mem locality.  I.E. while more expensive CPU wise,
> sort(1) would use a divide and conquer approach to achieve better
> memory locality, while cp updates its hash as it copies, which is the
> best approach assuming no swapping, but that results in essentially
> random parts of mem (swap) being updated as it runs.
> 
> That resulted on the 10G RAM system in around 30/420 inodes/direntries
> being copied per second on average over the 10 day run.
> 
> It's worth mentioning that mechanical disks with their orders of magnitude
> slower random access latencies are going away, so we must consider that
> along with the frequency of this use case when making any changes to
> the complexity of the code.
> 
> If we were to address this, the two approaches would be to
> 1. reduce the size of the hash
> 2. employ locality-sensitive hashing
> 
> Looking at 1, I see there was an attempt to reduce the size
> of the hash by excluding traversed source directories:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035
> 
> Now Rasmus' analysis suggests that this didn't happen for him,
> resulting in a 350% mem increase for his case, which would have been
> more than enough to move the working set from RAM to swap.
> 
> The patch above didn't cater for the fact that dirs
> generally have st_nlink > 1, which is fixed up in the following patch:
> 
> Note there is a caveat where this will now not detect "hardlinked dirs"
> as reported (intermittently?) on netapp file systems, and thus not
> exit with an error, and instead just create new directories in that case.
> Perhaps that's what we should be doing anyway?
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6#l2173
> 
> diff --git a/src/copy.c b/src/copy.c
> index a9561c6..a2e297f 100644
> --- a/src/copy.c
> +++ b/src/copy.c
> @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const 
> *dst_name,
>      }
>    else if (x->preserve_links
>             && !x->hard_link
> -           && (1 < src_sb.st_nlink
> +           && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode))
>                 || (command_line_arg
>                     && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
>                 || x->dereference == DEREF_ALWAYS))

To get an idea of the impact of removing the dirs from the hash as per the 
above commit,
I did a search for the corresponding "cp: will not create hardlink" error:
  https://www.google.com/search?q="cp:+will+not+create+hardlink";
There were a few correct hits but also a few cases where
the error should not have been emitted!

The two patches attached fix some of these and change cp to behave as follows:

  $ mkdir d1 dest
  $ cp -a d1 d1 dest
  cp: warning: source directory ‘d1’ specified more than once

  $ ln -s d1 d_sl
  $ cp -aH d1 d_sl dest
  $ ls dest
  d1  d_sl

As for the patch above, I'm still holding off due to wariness about
dir hardlinks which OS X supports it seems 
(http://stackoverflow.com/q/80875/4421)
and other cases I still need to check, like bind mounted dirs causing cycles.

thanks,
Pádraig.
>From f105bbf4a9648816de5af7e85a510e3415c8105e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Wed, 17 Sep 2014 18:50:08 +0100
Subject: [PATCH 1/2] cp: issue correct warning and ignore duplicate source
 dirs

* src/copy.c (copy_internal): Handle the case where we have the
same destination directory as already encountered, which can only
be due to the corresponding source directory being specified multiple
times.
* tests/cp/duplicate-sources.sh: Add a test for the new multiply
specified directory case, and the existing multiply specified file case.
* tests/local.mk: Reference the new test.
* NEWS: Mention the bug fix.
---
 NEWS                          |    6 ++++++
 src/copy.c                    |   11 +++++++++++
 tests/cp/duplicate-sources.sh |   35 +++++++++++++++++++++++++++++++++++
 tests/local.mk                |    1 +
 4 files changed, 53 insertions(+), 0 deletions(-)
 create mode 100755 tests/cp/duplicate-sources.sh

diff --git a/NEWS b/NEWS
index ba0d3d7..ede5fd1 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU coreutils NEWS                                    -*- outline -*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Bug fixes
+
+  cp no longer issues an incorrect warning about directory hardlinks when a
+  source directory is specified multiple times.  Instead a warning is issued
+  and the duplicate source directory is skipped, as already the case for files.
+
 ** New features
 
   chroot accepts the new --skip-chdir option to not change the working directory
diff --git a/src/copy.c b/src/copy.c
index a9561c6..4e0d266 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -2187,6 +2187,17 @@ copy_internal (char const *src_name, char const *dst_name,
               *copy_into_self = true;
               goto un_backup;
             }
+          else if (same_name (dst_name, earlier_file))
+            {
+              error (0, 0, _("warning: source directory %s "
+                             "specified more than once"),
+                     quote (top_level_src_name));
+              /* We only do backups in move mode and for non dirs,
+                 and in move mode this won't be the issue as the source will
+                 be missing for subsequent attempts.
+                 There we just warn and return here.  */
+              return true;
+            }
           else if (x->dereference == DEREF_ALWAYS)
             {
               /* This happens when e.g., encountering a directory for the
diff --git a/tests/cp/duplicate-sources.sh b/tests/cp/duplicate-sources.sh
new file mode 100755
index 0000000..e15c6d2
--- /dev/null
+++ b/tests/cp/duplicate-sources.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+# Ensure cp warns about but otherwise ignores source
+# items specified multiple times.
+
+# Copyright (C) 2014 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ cp
+
+mkdir a b || framework_failure_
+touch f || framework_failure_
+
+cp -a a a f f b 2>err || fail=1
+
+cat <<EOF >exp
+cp: warning: source directory 'a' specified more than once
+cp: warning: source file 'f' specified more than once
+EOF
+
+compare exp err || fail=1
+
+Exit $fail
diff --git a/tests/local.mk b/tests/local.mk
index 1edaaf4..97bf5ed 100644
--- a/tests/local.mk
+++ b/tests/local.mk
@@ -427,6 +427,7 @@ all_tests =					\
   tests/cp/dir-rm-dest.sh			\
   tests/cp/dir-slash.sh				\
   tests/cp/dir-vs-file.sh			\
+  tests/cp/duplicate-sources.sh			\
   tests/cp/existing-perm-dir.sh			\
   tests/cp/existing-perm-race.sh		\
   tests/cp/fail-perm.sh				\
-- 
1.7.7.6


>From 469988c772a38c0e44bd7503ed5b98c8db4abea9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 18 Sep 2014 11:31:22 +0100
Subject: [PATCH 2/2] cp: fix handling of -H with multiply specified source
 dirs

Following on from commit v5.92-729-g130dd06, also avoid
the erroneous directory hardlink warning with -H.

* src/copy.c (copy_internal): Also handle the -H case
for command line arguments.
* tests/cp/duplicate-sources.sh: Augment the test case.
* NEWS: Augment the news entry.
---
 NEWS                          |    5 +++--
 src/copy.c                    |    4 +++-
 tests/cp/duplicate-sources.sh |   13 +++++++++++--
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/NEWS b/NEWS
index ede5fd1..60117aa 100644
--- a/NEWS
+++ b/NEWS
@@ -5,8 +5,9 @@ GNU coreutils NEWS                                    -*- outline -*-
 ** Bug fixes
 
   cp no longer issues an incorrect warning about directory hardlinks when a
-  source directory is specified multiple times.  Instead a warning is issued
-  and the duplicate source directory is skipped, as already the case for files.
+  source directory is specified multiple times.  Now, consistent with other
+  file types, a warning is issued for source directories with duplicate names,
+  or with -H the directory is copied again using the symlink name.
 
 ** New features
 
diff --git a/src/copy.c b/src/copy.c
index 4e0d266..b7baee4 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -2198,7 +2198,9 @@ copy_internal (char const *src_name, char const *dst_name,
                  There we just warn and return here.  */
               return true;
             }
-          else if (x->dereference == DEREF_ALWAYS)
+          else if (x->dereference == DEREF_ALWAYS
+                   || (command_line_arg
+                       && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS))
             {
               /* This happens when e.g., encountering a directory for the
                  second or subsequent time via symlinks when cp is invoked
diff --git a/tests/cp/duplicate-sources.sh b/tests/cp/duplicate-sources.sh
index e15c6d2..35b2e46 100755
--- a/tests/cp/duplicate-sources.sh
+++ b/tests/cp/duplicate-sources.sh
@@ -20,10 +20,19 @@
 . "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
 print_ver_ cp
 
-mkdir a b || framework_failure_
+mkdir a || framework_failure_
 touch f || framework_failure_
 
-cp -a a a f f b 2>err || fail=1
+# verify multiple files and dir sources only warned about
+mkdir dest || framework_failure_
+cp -a a a f f dest 2>err || fail=1
+rm -Rf dest || framework_failure_
+
+# verify multiple dirs and files with different names copied
+mkdir dest || framework_failure_
+ln -s a al || framework_failure_
+ln -s f fl || framework_failure_
+cp -aH a al f fl dest 2>>err || fail=1
 
 cat <<EOF >exp
 cp: warning: source directory 'a' specified more than once
-- 
1.7.7.6

Reply via email to