James: Hi there,
after the yesterday's upload of perforate i was wondering wether i could
close, or maybe retitle, this bug.
This last upload included a patch by Kari, that, in his own words,
implemented several of your suggestions:
<@Amaya> kaol: do you think i may close #314548 now? I just uploaded
perforate with your patch in it.
<@kaol> Amaya: I did only part of what was in #314548. There was also a
suggestion to do diff -q if there were only two files with the same size
and not compute full md5 sums
<@kaol> also, he asked for an option to do diff -q to for all
comparisons, to avoid any md5 hash collisions (unlikely as they may be)
<@kaol> I just skipped calculating md5 sums for those files that are
alone in being of their size
<@kaol> which may help performance quite a bit already
<@kaol> I didn't feel like rewriting the logic enough to put all that
stuff in there
* kaol waves
So I would like you to test the speed improvement and see if we can
consider some of these issues as dealt with:
> - The initial scan of the directory tree should record only file
> size, ownership, and permissions. If there is only one file of a
> given length, then it has no duplicates and need not be read.
This has been dealt with :)
> - Optionally, ignore small files (where linking would not save much
> space anyway)
Not dealt with AFAIK.
> - Per bug #263782, if files of a given size all differ in ownership
> and/or permissions, they need not be read. (Actually I would like
> the option of ignoring ownership and permissions. In a read-only
> backup it might be okay to link files with differing ownership and
> permissions.)
Dealt with.
> - If there are exactly two files of a given size, ownership, and
> permissions, then use "diff -q" to compare them (so the two files are
> read only up to the first difference).
Not dealt with AFAIK.
> - If there are more than two files of a given size, then use md5sum
> to identify probable duplicate files. To guard against false
> matches, I would advocate checking with "diff -q". In that case, I
> suggest calculating the md5sum only of the leading part of the file
> (say, the first 4096 bytes). If differences are found in the leading
> parts of the files then of course the remainders of the files need
> not be read. Other users might not insist on the "diff -q" check, in
> which case the md5sum should of course be calculated on the entire
> file.
Not dealt with AFAIK.
So maybe this could be now retitled as 'Include "diff -q" tests to speed
up and complete md5sum calculations'.
Let me know what you think or any other valuable suggestions you might
come up with.
--
.''`. Follow the white Rabbit - Ranty (and Lewis Carroll)
: :' :
`. `' Proudly running unstable Debian GNU/Linux
`- www.amayita.com www.malapecora.com www.chicasduras.com
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]