Hi Martin, On Fri, 2 Feb 2007 12:34:35 +0100, Martin Quinson wrote: > in attachement, a script version of the backup-files program. This is used > in the debian package since several months, so it's tested and should be > solid. > > We did so because nowaday, quilt is quite deeply implicated in the build > dependencies of debian, and having the package architecture independent > solved them quite a lot. > > I have a lot of other patches in the debian package, and I'm willing to see > them integrated upstream, but this one is the most important one.
This is very interesting. I always considered that having one part of quilt written in C was a nonsense, and I would be please to get rid of it. Of course this will come at the price of some performance penalty, as a shell script will never be as fast as a compiled C program, but if the penalty is reasonable, we should consider it. Unfortunately your implementation doesn't quite fit in the "reasonable" limits. I made some performance measurements on the four affected commands (pop, add, remove and snapshot) and here are the performance degradation I have observed: quilt pop: 3.1 times slower quilt add: 1.5 times slower quilt remove: 1.4 times slower quilt snapshot: 5.3 times slower As a side note, I really wonder how it comes that "quilt pop" needs backup-files but "quilt push" doesn't. These performance penalties are quite high, and I am especially worried about pop and add as these are fundamental commands a quilt user will run very often. So I've made a complete review of your script to find out where we could improve the situation. Good news is, I've found quite a lot of things that can be changed, which should restore some of the lost performance. In fact I've tested again after implementing all my proposals and the penalties were down to: quilt pop: 2.4 times slower quilt add: 1.2 times slower quilt remove: 1.2 times slower quilt snapshot: 2.2 times slower I don't think we can hope for better than 1.2. The penalties are still important for pop and snapshot, although much better that the original number, hopefully someone will find out how this can be improved. But OTOH maybe some of my cleanups were not correct. I'm posting my review now so that others get a chance to comment on that. Note: many comments apply to several places, I only comment on the first occurence each time. > #! @BASH@ > > set -e I don't see the benefit of setting this. > > # File: backup-files.sh > > # Copyright (C) 2006 Steve Langasek <[EMAIL PROTECTED]> > # portions Copyright (C) 2003, 2004, 2005, 2006 Andreas Gruenbacher > # <[EMAIL PROTECTED]>, SuSE Labs > > # This program is free software; you can redistribute it and/or modify > # it under the terms of the GNU General Public License as published by > # the Free Software Foundation; version 2 dated June, 1991. > > # This program is distributed in the hope that it will be useful, but > # WITHOUT ANY WARRANTY; without even the implied warranty of > # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > # General Public License for more details. > > # You should have received a copy of the GNU General Public License > # along with this program; if not, write to the Free Software > # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, > # MA 02110-1301, USA. > > > # Create backup files of a list of files similar to GNU patch. A path > # name prefix and suffix for the backup file can be specified with the > # -B and -z options. > > usage () { > progname="$1" > printf "Usage: %s [-B prefix] [-z suffix] [-f {file|-}] [-s] [-b|-r|-x] > [-t] [-L] {file|-} ...\n"\ > "\n"\ > "\tCreate hard linked backup copies of a list of files\n"\ > "\tread from standard input.\n"\ > "\n"\ > "\t-b\tCreate backup\n"\ > "\t-r\tRestore the backup\n"\ > "\t-x\tRemove backup files and empty parent directories\n"\ > "\t-B\tPath name prefix for backup files\n"\ > "\t-z\tPath name suffix for backup files\n"\ > "\t-s\tSilent operation; only print error messages\n"\ > "\t-f\tRead the filenames to process from file (- = standard input)\n"\ > "\t-t\tTouch original files after restore (update their mtimes)\n\n"\ > "\t-L\tEnsure that when finished, the source file has a link count of > 1\n\n", \ > "$progname" > } The option -z can be dropped, as we don't use it. This allows some code cleanups. > > link_or_copy_file() { > from="$1" > to="$2" > if ! ln "$from" "$to" || [ -d "$to" ]; then I don't understand the -d test here, it looks bogus. > cp -dp --remove-destination "$from" "$to" I wonder how portable the --remove-destination option is. I don't think it is needed anyway, as you always explicitely delete the destination file before calling this function. > fi > } > > ensure_nolinks() { > filename="$1" All your variables inside functions should be declared as local. > > link_count=$(stat -c '%h' "$filename") > if [ -z "$link_count" ] || [ "$link_count" -gt 1 ]; then These double test constructs are better implemented as a single test with -o (or -a) between them. > dirname=$(dirname "$filename") > basename=$(basename "$filename") Calling dirname and basename has a high cost. Using the bash prefix/suffix removal functions is two degrees of magnitude faster: dirname="${filename%/*}" basename="${filename##*/}" Believe it or not, it was one of the highest performance improvement of all the changes I made. > # Temp file name is "path/to/.file.XXXXXX" > tmpname=$(mktemp "${dirname}/.${basename}.XXXXXX") > cp -dp "$filename" "$tmpname" > mv "$tmpname" "$filename" > fi > } > > process_file() { > file="$1" > backup="${OPT_PREFIX}${file}${OPT_SUFFIX}" > > if [ "$OPT_WHAT" == "backup" ]; then Using strings as switch keys isn't very efficient, it forces bash to make string comparisons each time. It is especially bad as two of the options happen to begin with the same 2 letters. Using single (distinct) letters should help performance. > if [ -e "$backup" ]; then > rm "$backup" You should always call "rm -f" instead of "rm" in scripts. > else > mkdir -p "$(dirname "$backup")" > fi > if [ ! -e "$file" ]; then > $ECHO "New file $file" > touch "$backup" > chmod 000 "$backup" > else > $ECHO "Copying $file" > link_or_copy_file "$file" "$backup" > if [ -n "$OPT_NOLINKS" ]; then > ensure_nolinks "$file" > fi This construct doesn't look smart to me. First you try to create a link on the file, and right after than you may need to ensure that said file has no links! This operation is quite costly as you need to create a temporary file each time. I propose the following variant: if [ -n "$OPT_NOLINKS" ]; then cp -dp "$file" "$backup" ensure_nolinks "$file" else link_or_copy_file "$file" "$backup" fi This change gives a big performance boost. > if [ -n "OPT_TOUCH" ]; then > touch "$backup" > fi This option isn't supposed to be implemented for the backup function, I think we can remove it. > fi > elif [ "$OPT_WHAT" == "restore" ]; then > mkdir -p "$(dirname "$file")" > > if [ ! -e "$backup" ]; then > return 1 > fi > if [ ! -s "$backup" ]; then > if [ -e "$file" ]; then > rm "$file" > fi This test isn't really needed. In the vast majority of cases, the file exists. If not, "rm -f" will still succeed. > $ECHO "Removing $file" > rm "$backup" You can merge this rm command with the previous one, saving a call. > while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ] > do > backup="${backup%/*}" > rmdir --ignore-fail-on-non-empty "$backup" > 2>/dev/null > if [ -d "$backup" ]; then > break > fi > done Wow. Isn't this an exact reimplementation of "rmdir -p"? We can also drop the --ignore-fail-on-non-empty option (might not be very portable?) if we remove the "set -e" at the beginning of the script. > else > $ECHO "Restoring $file" > if [ -e "$file" ]; then > rm "$file" > fi > link_or_copy_file "$backup" "$file" > rm "$backup" This doesn't look good. First you remove the file, then you create a link or copy from the backup to the file, and last you remove the backup file. Can't we just use "mv -f" here? It gives a huge performance boost. > while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ] > do > backup="${backup%/*}" > rmdir --ignore-fail-on-non-empty "$backup" > 2>/dev/null > if [ -d "$backup" ]; then > break > fi > done > if [ -n "$OPT_NOLINKS" ]; then > ensure_nolinks "$file" > fi > if [ -n "$OPT_TOUCH" ]; then > touch "$file" > fi > fi > elif [ "$OPT_WHAT" == "remove" ]; then > if [ -e "$backup" ]; then > rm "$backup" > fi > while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ] > do > backup="${backup%/*}" > rmdir --ignore-fail-on-non-empty "$backup" 2>/dev/null > if [ -d "$backup" ]; then > break > fi > done > elif [ "$OPT_WHAT" == "noop" ]; then You have a bug here, OPT_WHAT is never set to "noop" so this test can never succeed. I guess you want [ -z "$OPT_WHAT" ] here instead. > if [ -e "$file" ] && [ -n "$OPT_NOLINKS" ]; then > ensure_nolinks "$file" > fi > else > return 1 This can't really happen so why bother? > fi > } > > walk() { > path="$1" > if [ ! -f "$path" ]; then > return 0 > fi I don't think this can happen, as you call this function on the files "find" found. > > if [ "${path#$OPT_PREFIX}" == "$path" ] > then > # prefix does not match > return 0 > fi > path="${path#$OPT_PREFIX}" You are doing the same prefix stripping twice, this could be avoided. > > if [ -n "$OPT_SUFFIX" ] && [ "${path%$OPT_SUFFIX}" == "$path" ] > then > # suffix does not match > return 0 > fi > path="${path%$OPT_SUFFIX}" > > process_file "$path" > } > > > ECHO=echo > declare -a FILELIST > progname="$0" $0 is a global, and you only need it once, so I don't think it's valuable to define another variable and to pass it to functions. > while [ $# -gt 0 ]; do > case $1 in > -b) OPT_WHAT=backup > ;; > -r) OPT_WHAT=restore > ;; > -x) OPT_WHAT=remove > ;; > -B) OPT_PREFIX=$2 > shift > ;; > -f) OPT_FILE=$2 > shift > ;; > -z) OPT_SUFFIX=$2 > shift > ;; > -s) ECHO=: > ;; > -L) OPT_NOLINKS=1 > ;; > -t) OPT_TOUCH=1 > ;; > -?*) usage "$progname" > exit 0 > ;; > *) FILELIST=($1) > ;; > esac > > shift > done > > if [ -z "${OPT_PREFIX}${OPT_SUFFIX}" ]; then > usage "$progname" > exit 1 > fi > if [ [EMAIL PROTECTED] == 0 ] && [ -z "$OPT_FILE" ]; then > usage "$progname" > exit 1 > fi Both tests can be merged. > > if [ -n "$OPT_FILE" ]; then > cat "$OPT_FILE" \ > | while read nextfile; do > process_file "$nextfile" > done > fi > > I=0 > while [ $I -lt [EMAIL PROTECTED] ]; do > > case "${FILELIST[$I]}" in > -) > path="${OPT_PREFIX%/*}" > if ! [ -n "$path" ] && [ -d "$path" ] ; then This test looks very bogus to me. I don't think it can succeed at all. What was the original intent? > I=$(($I+1)) > continue > fi > > find "$path" -mindepth 1 \( -type f -o -type d \) -print > 2>/dev/null \ > | while read > do > if [ -d "$REPLY" ] > then > if ! [ -r "$REPLY" ] || ! [ -x "$REPLY" ] > then > echo "$REPLY: Permission denied" > exit 1 > fi > else > walk "$REPLY" > fi > done You ask "find" to list both files and directories, then for each directory you make additional tests, while you don't actually do anything useful with these directories. I guess the idea is to anticipate errors, but overall it's quite expensive, as you end up calling stat once for each file and 3 times for each directory. All this to handle an error which is highly unlikely to happen in quilt - users aren't supposed to mess up with the .pc directory contents. So I'd rather let find print errors by itself if needed, and only have it list the regular files we are interested in. > if [ $? != 0 ]; then > exit 1 > fi > ;; > *) > process_file "${FILELIST[$I]}" > ;; > esac > > I=$(($I+1)) > done I am interested in comments on my suggestions, and even more in other suggestions to make the script faster. One thing we could do is convert this external script to a sourceable script (a la patchfns), this would let us make further cleanups. Martin, care to update your patch based on the comments you agree with? Thanks, -- Jean Delvare _______________________________________________ Quilt-dev mailing list Quilt-dev@nongnu.org http://lists.nongnu.org/mailman/listinfo/quilt-dev