Bug#314548: perforate: suggestions for speeding up the program

Amaya Fri, 11 Nov 2005 01:05:17 -0800

James: Hi there, 

after the yesterday's upload of perforate i was wondering wether i could
close, or maybe retitle, this bug.


This last upload included a patch by Kari, that, in his own words,
implemented several of your suggestions:

<@Amaya> kaol:  do you think i may close #314548 now? I just uploaded
perforate with your patch in it.
<@kaol> Amaya: I did only part of what was in #314548. There was also a
suggestion to do diff -q if there were only two files with the same size
and not compute full md5 sums
<@kaol> also, he asked for an option to do diff -q to for all
comparisons, to avoid any md5 hash collisions (unlikely as they may be)
<@kaol> I just skipped calculating md5 sums for those files that are
alone in being of their size
<@kaol> which may help performance quite a bit already
<@kaol> I didn't feel like rewriting the logic enough to put all that
stuff in there
* kaol waves
                                
So I would like you to test the speed improvement and see if we can
consider some of these issues as dealt with:

>  - The initial scan of the directory tree should record only file
>  size, ownership, and permissions.  If there is only one file of a
>  given length, then it has no duplicates and need not be read.

This has been dealt with :)

>  - Optionally, ignore small files (where linking would not save much
>  space anyway)

Not dealt with AFAIK. 

>  - Per bug #263782, if files of a given size all differ in ownership
>  and/or permissions, they need not be read.  (Actually I would like
>  the option of ignoring ownership and permissions.  In a read-only
>  backup it might be okay to link files with differing ownership and
>  permissions.)

Dealt with.

>  - If there are exactly two files of a given size, ownership, and
>  permissions, then use "diff -q" to compare them (so the two files are
>  read only up to the first difference).

Not dealt with AFAIK.

>  - If there are more than two files of a given size, then use md5sum
>  to identify probable duplicate files.  To guard against false
>  matches, I would advocate checking with "diff -q".  In that case, I
>  suggest calculating the md5sum only of the leading part of the file
>  (say, the first 4096 bytes).  If differences are found in the leading
>  parts of the files then of course the remainders of the files need
>  not be read.  Other users might not insist on the "diff -q" check, in
>  which case the md5sum should of course be calculated on the entire
>  file.

Not dealt with AFAIK.

So maybe this could be now retitled as 'Include "diff -q" tests to speed
up and complete md5sum calculations'.

Let me know what you think or any other valuable suggestions you might
come up with.

-- 
 .''`.       Follow the white Rabbit - Ranty (and Lewis Carroll)
: :' :           
`. `'           Proudly running unstable Debian GNU/Linux
  `-     www.amayita.com  www.malapecora.com  www.chicasduras.com


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Bug#314548: perforate: suggestions for speeding up the program

Reply via email to