Re: [Quilt-dev] Making backup-files a shell script

Jean Delvare Fri, 23 Feb 2007 13:39:15 -0800

Hi Martin,

On Fri, 2 Feb 2007 12:34:35 +0100, Martin Quinson wrote:
> in attachement, a script version of the backup-files program. This is used
> in the debian package since several months, so it's tested and should be
> solid.
> 
> We did so because nowaday, quilt is quite deeply implicated in the build
> dependencies of debian, and having the package architecture independent
> solved them quite a lot.
> 
> I have a lot of other patches in the debian package, and I'm willing to see
> them integrated upstream, but this one is the most important one.


This is very interesting. I always considered that having one part of
quilt written in C was a nonsense, and I would be please to get rid of
it. Of course this will come at the price of some performance penalty,
as a shell script will never be as fast as a compiled C program, but
if the penalty is reasonable, we should consider it.

Unfortunately your implementation doesn't quite fit in the "reasonable"
limits. I made some performance measurements on the four affected
commands (pop, add, remove and snapshot) and here are the performance
degradation I have observed:
quilt pop:       3.1 times slower
quilt add:       1.5 times slower
quilt remove:    1.4 times slower
quilt snapshot:  5.3 times slower

As a side note, I really wonder how it comes that "quilt pop" needs
backup-files but "quilt push" doesn't.

These performance penalties are quite high, and I am especially worried
about pop and add as these are fundamental commands a quilt user will
run very often.

So I've made a complete review of your script to find out where we
could improve the situation. Good news is, I've found quite a lot of
things that can be changed, which should restore some of the lost
performance. In fact I've tested again after implementing all my
proposals and the penalties were down to:
quilt pop:       2.4 times slower
quilt add:       1.2 times slower
quilt remove:    1.2 times slower
quilt snapshot:  2.2 times slower

I don't think we can hope for better than 1.2. The penalties are still
important for pop and snapshot, although much better that the original
number, hopefully someone will find out how this can be improved. But
OTOH maybe some of my cleanups were not correct. I'm posting my review
now so that others get a chance to comment on that.

Note: many comments apply to several places, I only comment on the first
occurence each time.

> #! @BASH@
> 
> set -e

I don't see the benefit of setting this.

> 
> # File: backup-files.sh
> 
> # Copyright (C) 2006 Steve Langasek <[EMAIL PROTECTED]>
> # portions Copyright (C) 2003, 2004, 2005, 2006 Andreas Gruenbacher
> # <[EMAIL PROTECTED]>, SuSE Labs
> 
> # This program is free software; you can redistribute it and/or modify
> # it under the terms of the GNU General Public License as published by
> # the Free Software Foundation; version 2 dated June, 1991.
> 
> # This program is distributed in the hope that it will be useful, but 
> # WITHOUT ANY WARRANTY; without even the implied warranty of
> # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU 
> # General Public License for more details.
> 
> # You should have received a copy of the GNU General Public License 
> # along with this program;  if not, write to the Free Software
> # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston,
> # MA 02110-1301, USA.
> 
> 
> # Create backup files of a list of files similar to GNU patch. A path
> # name prefix and suffix for the backup file can be specified with the
> # -B and -z options.
> 
> usage () {
>       progname="$1"
>       printf "Usage: %s [-B prefix] [-z suffix] [-f {file|-}] [-s] [-b|-r|-x] 
> [-t] [-L] {file|-} ...\n"\
>       "\n"\
>       "\tCreate hard linked backup copies of a list of files\n"\
>       "\tread from standard input.\n"\
>       "\n"\
>       "\t-b\tCreate backup\n"\
>       "\t-r\tRestore the backup\n"\
>       "\t-x\tRemove backup files and empty parent directories\n"\
>       "\t-B\tPath name prefix for backup files\n"\
>       "\t-z\tPath name suffix for backup files\n"\
>       "\t-s\tSilent operation; only print error messages\n"\
>       "\t-f\tRead the filenames to process from file (- = standard input)\n"\
>       "\t-t\tTouch original files after restore (update their mtimes)\n\n"\
>       "\t-L\tEnsure that when finished, the source file has a link count of 
> 1\n\n", \
>       "$progname"
> }

The option -z can be dropped, as we don't use it. This allows some code
cleanups.

> 
> link_or_copy_file() {
>       from="$1"
>       to="$2"
>       if ! ln "$from" "$to" || [ -d "$to" ]; then

I don't understand the -d test here, it looks bogus.

>               cp -dp --remove-destination "$from" "$to"

I wonder how portable the --remove-destination option is. I don't think
it is needed anyway, as you always explicitely delete the destination
file before calling this function.

>       fi
> }
> 
> ensure_nolinks() {
>       filename="$1"

All your variables inside functions should be declared as local.

> 
>       link_count=$(stat -c '%h' "$filename")
>       if [ -z "$link_count" ] || [ "$link_count" -gt 1 ]; then

These double test constructs are better implemented as a single test
with -o (or -a) between them.

>               dirname=$(dirname "$filename")
>               basename=$(basename "$filename")

Calling dirname and basename has a high cost. Using the bash
prefix/suffix removal functions is two degrees of magnitude faster:

                dirname="${filename%/*}"
                basename="${filename##*/}"

Believe it or not, it was one of the highest performance improvement of
all the changes I made.

>               # Temp file name is "path/to/.file.XXXXXX"
>               tmpname=$(mktemp "${dirname}/.${basename}.XXXXXX")
>               cp -dp "$filename" "$tmpname"
>               mv "$tmpname" "$filename"
>       fi
> }
> 
> process_file() {
>       file="$1"
>       backup="${OPT_PREFIX}${file}${OPT_SUFFIX}"
> 
>       if [ "$OPT_WHAT" == "backup" ]; then

Using strings as switch keys isn't very efficient, it forces bash to
make string comparisons each time. It is especially bad as two of the
options happen to begin with the same 2 letters. Using single
(distinct) letters should help performance.

>               if [ -e "$backup" ]; then
>                       rm "$backup"

You should always call "rm -f" instead of "rm" in scripts.

>               else
>                       mkdir -p "$(dirname "$backup")"
>               fi
>               if [ ! -e "$file" ]; then
>                       $ECHO "New file $file"
>                       touch "$backup"
>                       chmod 000 "$backup"
>               else
>                       $ECHO "Copying $file"
>                       link_or_copy_file "$file" "$backup"
>                       if [ -n "$OPT_NOLINKS" ]; then
>                               ensure_nolinks "$file"
>                       fi

This construct doesn't look smart to me. First you try to create a link
on the file, and right after than you may need to ensure that said file
has no links! This operation is quite costly as you need to create a
temporary file each time. I propose the following variant:

                        if [ -n "$OPT_NOLINKS" ]; then
                                cp -dp "$file" "$backup"
                                ensure_nolinks "$file"
                        else
                                link_or_copy_file "$file" "$backup"
                        fi

This change gives a big performance boost.

>                       if [ -n "OPT_TOUCH" ]; then
>                               touch "$backup"
>                       fi

This option isn't supposed to be implemented for the backup function, I
think we can remove it.

>               fi
>       elif [ "$OPT_WHAT" == "restore" ]; then
>               mkdir -p "$(dirname "$file")"
> 
>               if [ ! -e "$backup" ]; then
>                       return 1
>               fi
>               if [ ! -s "$backup" ]; then
>                       if [ -e "$file" ]; then
>                               rm "$file"
>                       fi

This test isn't really needed. In the vast majority of cases, the file
exists. If not, "rm -f" will still succeed.

>                       $ECHO "Removing $file"
>                       rm "$backup"

You can merge this rm command with the previous one, saving a call.

>                       while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ]
>                       do
>                               backup="${backup%/*}"
>                               rmdir --ignore-fail-on-non-empty "$backup" 
> 2>/dev/null
>                               if [ -d "$backup" ]; then
>                                       break
>                               fi
>                       done

Wow. Isn't this an exact reimplementation of "rmdir -p"? We can also
drop the --ignore-fail-on-non-empty option (might not be very
portable?) if we remove the "set -e" at the beginning of the script.

>               else
>                       $ECHO "Restoring $file"
>                       if [ -e "$file" ]; then
>                               rm "$file"
>                       fi
>                       link_or_copy_file "$backup" "$file"
>                       rm "$backup"

This doesn't look good. First you remove the file, then you create a
link or copy from the backup to the file, and last you remove the
backup file. Can't we just use "mv -f" here? It gives a huge performance
boost.

>                       while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ]
>                       do
>                               backup="${backup%/*}"
>                               rmdir --ignore-fail-on-non-empty "$backup" 
> 2>/dev/null
>                               if [ -d "$backup" ]; then
>                                       break
>                               fi
>                       done
>                       if [ -n "$OPT_NOLINKS" ]; then
>                               ensure_nolinks "$file"
>                       fi
>                       if [ -n "$OPT_TOUCH" ]; then
>                               touch "$file"
>                       fi
>               fi
>       elif [ "$OPT_WHAT" == "remove" ]; then
>               if [ -e "$backup" ]; then
>                       rm "$backup"
>               fi
>               while [ -d "${backup%/*}" ] && ! [ -L "${backup%/*}" ]
>               do
>                       backup="${backup%/*}"
>                       rmdir --ignore-fail-on-non-empty "$backup" 2>/dev/null
>                       if [ -d "$backup" ]; then
>                               break
>                       fi
>               done
>       elif [ "$OPT_WHAT" == "noop" ]; then

You have a bug here, OPT_WHAT is never set to "noop" so this test can
never succeed. I guess you want [ -z "$OPT_WHAT" ] here instead.

>               if [ -e "$file" ] && [ -n "$OPT_NOLINKS" ]; then
>                       ensure_nolinks "$file"
>               fi
>       else
>               return 1

This can't really happen so why bother?

>       fi
> }
> 
> walk() {
>       path="$1"
>       if [ ! -f "$path" ]; then
>               return 0
>       fi

I don't think this can happen, as you call this function on the files
"find" found.

> 
>       if [ "${path#$OPT_PREFIX}" == "$path" ]
>       then
>               # prefix does not match
>               return 0
>       fi
>       path="${path#$OPT_PREFIX}"

You are doing the same prefix stripping twice, this could be avoided.

> 
>       if [ -n "$OPT_SUFFIX" ] && [ "${path%$OPT_SUFFIX}" == "$path" ]
>       then
>               # suffix does not match
>               return 0
>       fi
>       path="${path%$OPT_SUFFIX}"
> 
>       process_file "$path"
> }
> 
> 
> ECHO=echo
> declare -a FILELIST
> progname="$0"

$0 is a global, and you only need it once, so I don't think it's
valuable to define another variable and to pass it to functions.

> while [ $# -gt 0 ]; do
>       case $1 in
>       -b)     OPT_WHAT=backup
>               ;;
>       -r)     OPT_WHAT=restore
>               ;;
>       -x)     OPT_WHAT=remove
>               ;;
>       -B)     OPT_PREFIX=$2
>               shift
>               ;;
>       -f)     OPT_FILE=$2
>               shift
>               ;;
>       -z)     OPT_SUFFIX=$2
>               shift
>               ;;
>       -s)     ECHO=:
>               ;;
>       -L)     OPT_NOLINKS=1
>               ;;
>       -t)     OPT_TOUCH=1
>               ;;
>       -?*)    usage "$progname"
>               exit 0
>               ;;
>       *)      FILELIST=($1)
>               ;;
>       esac
> 
>         shift
> done
> 
> if [ -z "${OPT_PREFIX}${OPT_SUFFIX}" ]; then
>       usage "$progname"
>       exit 1
> fi
> if [ [EMAIL PROTECTED] == 0 ] && [ -z "$OPT_FILE" ]; then
>       usage "$progname"
>       exit 1
> fi

Both tests can be merged.

> 
> if [ -n "$OPT_FILE" ]; then
>       cat "$OPT_FILE" \
>       | while read nextfile; do
>               process_file "$nextfile"
>       done
> fi
> 
> I=0
> while [ $I -lt [EMAIL PROTECTED] ]; do
> 
>       case "${FILELIST[$I]}" in
>       -)
>               path="${OPT_PREFIX%/*}"
>               if ! [ -n "$path" ] && [ -d "$path" ] ; then

This test looks very bogus to me. I don't think it can succeed at all.
What was the original intent?

>                       I=$(($I+1))
>                       continue
>               fi
> 
>               find "$path" -mindepth 1 \( -type f -o -type d \) -print 
> 2>/dev/null \
>               | while read
>               do
>                       if [ -d "$REPLY" ]
>                       then
>                               if ! [ -r "$REPLY" ] || ! [ -x "$REPLY" ]
>                               then
>                                       echo "$REPLY: Permission denied"
>                                       exit 1
>                               fi
>                       else
>                               walk "$REPLY"
>                       fi
>               done

You ask "find" to list both files and directories, then for each
directory you make additional tests, while you don't actually do
anything useful with these directories. I guess the idea is to
anticipate errors, but overall it's quite expensive, as you end up
calling stat once for each file and 3 times for each directory. All
this to handle an error which is highly unlikely to happen in quilt -
users aren't supposed to mess up with the .pc directory contents. So
I'd rather let find print errors by itself if needed, and only have it
list the regular files we are interested in.

>               if [ $? != 0 ]; then
>                       exit 1
>               fi
>               ;;
>       *)
>               process_file "${FILELIST[$I]}"
>               ;;
>       esac
>               
>       I=$(($I+1))
> done

I am interested in comments on my suggestions, and even more in other
suggestions to make the script faster. One thing we could do is convert
this external script to a sourceable script (a la patchfns), this would
let us make further cleanups.

Martin, care to update your patch based on the comments you agree with?

Thanks,
-- 
Jean Delvare


_______________________________________________
Quilt-dev mailing list
Quilt-dev@nongnu.org
http://lists.nongnu.org/mailman/listinfo/quilt-dev

Re: [Quilt-dev] Making backup-files a shell script

Reply via email to