Bug#656142: ITP: duff -- Duplicate file finder
> On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote: > > * Package name: duff > > * URL : http://duff.sourceforge.net/ On Tue, 2012-01-17 at 09:56 +0100, Simon Josefsson wrote: > If there aren't warnings about use of SHA1 in the tool, there should > be. While I don't recall any published SHA1 collisions, SHA1 is > considered broken and shouldn't be used if you want to trust your > comparisons. I'm assuming the tool supports SHA256 and other SHA2 > hashes as well? It might be useful to make sure the defaults are > non-SHA1. Duff supports SHA1, SHA256, SHA384 and SHA512 hashes. The default is SHA1. For comparison, rdfind supports MD5 but only SHA1 hashes. Thanks for the note Simon -- I'll bring it to the attention of the upstream author, Camilla Berglund. On Tue, 2012-01-17 at 09:12 +, Lars Wirzenius wrote: > rdfind seems to be quickest one, but duff compares well with hardlink, > which (see http://liw.fi/dupfiles/) was the fastest one I knew of in > Debian so far. > > This was done using my benchmark-cmd utility in my extrautils > collection (not in Debian): http://liw.fi/extrautils/ for source. Thanks for the pointer to your benchmark-cmd tool, Lars. Very handy! My results with it mirrored yours -- of the similar tools, duff appears to lag only rdfind in performance (for my particular dataset, at least). I looked into duff's methods a bit and discovered a few easy performance optimizations that may speed it up a bit more. The author is reviewing my proposed patch now, and seems very open to collaboration. > Personally, I would be wary of using checksums for file comparisons, > since comparing files byte-by-byte isn't slow (you only need to > do it to files that are identical in size, and you need to read > all the files anyway). Byte-by-byte might well be slower then checksums, if you end up faced with N>2 very large (uncacheable) files of identical size but unique contents. They all need to be checked against each other so each of the N files would need to be read N-1 times. Anyway, duff actually *does* offer byte-by-byte comparison as an option (rdfind does not). > I also think we've now got enough of duplicate file finders in > Debian that it's time to consider whether we need so many. It's > too bad they all have incompatible command line syntaxes, or it > would be possible to drop some. (We should accept a new one if > it is better than the existing ones, of course. Evidence required.) To me, the premise that a new package must be better than existing similar ones ("evidence required", no less) seems pretty questionable. It may not be so easy to establish just what "better than" means, and it puts us in a position of making value judgments for our users that they should be able to make for themselves. While I do think it is productive to compare performance of these similar tools to each other, I don't see much value in pitting them against each other in benchmark wars as criteria of acceptance into Debian. Here we have a good quality DFSG-compliant package with an active upstream and a willing DD maintainer. While similar tools do exist already in Debian, they do not offer identical feature sets or user interfaces, and only one of them has been shown to outperform duff in quick spot checks. Some users have expressed a preference for duff over the others. Does that make it "better than the existing ones"? My answer: Who cares? Nobody is making us choose only one. In my view, its not really a problem if carry multiple duplicate file detectors in Debian, and that we will best serve our users by letting them choose their preferred tool for the job. And by allowing such packages into Debian we encourage their improvement, to everyone's benefit. -Kamal signature.asc Description: This is a digitally signed message part
Bug#656142: ITP: duff -- Duplicate file finder
On Mon, Jan 16, 2012 at 12:58:13PM -0800, Kamal Mostafa wrote: > * Package name: duff > * URL : http://duff.sourceforge.net/ A quick speed comparison: real user system max RSS elapsed cmd (s) (s) (s)(KiB) (s) 3.2 2.4 5.862784 5.8 hardlink --dry-run files > /dev/null 1.1 0.4 1.615424 1.6 rdfind files > /dev/null 1.9 0.2 2.2 9904 2.2 duff-0.5/src/duff -r files > /dev/null rdfind seems to be quickest one, but duff compares well with hardlink, which (see http://liw.fi/dupfiles/) was the fastest one I knew of in Debian so far. This was done using my benchmark-cmd utility in my extrautils collection (not in Debian): http://liw.fi/extrautils/ for source. The exact command to generate the above table: benchmark-cmd \ --setup='genbackupdata --create=100m files' \ --setup='cp -a files/0 files/copy' \ --cleanup='rm -rf files' \ --verbose \ --command='hardlink --dry-run files > /dev/null' \ --command='rdfind files > /dev/null' \ --command='duff-0.5/src/duff -r files > /dev/null' Personally, I would be wary of using checksums for file comparisons, since comparing files byte-by-byte isn't slow (you only need to do it to files that are identical in size, and you need to read all the files anyway). I also think we've now got enough of duplicate file finders in Debian that it's time to consider whether we need so many. It's too bad they all have incompatible command line syntaxes, or it would be possible to drop some. (We should accept a new one if it is better than the existing ones, of course. Evidence required.) -- Freedom-based blog/wiki/web hosting: http://www.branchable.com/ signature.asc Description: Digital signature
Bug#656142: ITP: duff -- Duplicate file finder
Kamal Mostafa writes: > Package: wnpp > Severity: wishlist > Owner: Kamal Mostafa > > > * Package name: duff > Version : 0.5 > Upstream Author : Camilla Berglund > * URL : http://duff.sourceforge.net/ > * License : Zlib > Programming Lang: C > Description : Duplicate file finder > > Duff is a command-line utility for identifying duplicates in a given set of > files. It attempts to be usably fast and uses the SHA family of message > digests as a part of the comparisons. If there aren't warnings about use of SHA1 in the tool, there should be. While I don't recall any published SHA1 collisions, SHA1 is considered broken and shouldn't be used if you want to trust your comparisons. I'm assuming the tool supports SHA256 and other SHA2 hashes as well? It might be useful to make sure the defaults are non-SHA1. /Simon -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#656142: ITP: duff -- Duplicate file finder
also sprach Kamal Mostafa [2012.01.17.0049 +0100]: > In my humble opinion, that would be an unreasonable pre-condition for > inclusion in Debian. Our standard for inclusion should not be that a > new package must be "vastly better" than other similar packages. That > would deny a new package the opportunity to build a user base and > possibly someday evolve to become the "vastly better" alternative > itself. Right, but I'd say it needs to be better and the maintainer needs to be able to argue how it is better. -- .''`. martin f. krafft Related projects: : :' : proud Debian developer http://debiansystem.info `. `'` http://people.debian.org/~madduckhttp://vcs-pkg.org `- Debian - when you have better things to do than fixing systems "die zeit für kleine politik ist vorbei. schon das nächste jahrhundert bringt den kampf um die erdherrschaft." - friedrich nietzsche digital_signature_gpg.asc Description: Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
Bug#656142: ITP: duff -- Duplicate file finder
On Mon, 2012-01-16 at 23:07 +0100, Joerg Jaspert wrote: > >> What is it the benefit over fdupes, rdfind, ...? > > ..., hardlink, ... > finddup from perforate After a quick evaluation of the various "find dupe files" tools, I was attracted to try duff because: 1. It looked easier to use than the others. 2. This quote from its website[1] was exactly what I was looking for: "Note that duff itself never modifies any files, but it's designed to play nice with tools that do." The other dupe cleaner utilities left me worried that they might trash something important if I got my command line options wrong or forgot a --dry-run flag. > > Was thinking about packaging it myself already, so I may also sponsor > > Kamal's package when it's ready. Thanks Axel, but I'm a DD myself, so won't need a sponsor. > You just listed the third duplicate (and me no. 4), and still go blind > right on "ohoh, i sponsor it". Why? I hope its conditional on it being > vastly better than any of the others (speed, functionality, ...) In my humble opinion, that would be an unreasonable pre-condition for inclusion in Debian. Our standard for inclusion should not be that a new package must be "vastly better" than other similar packages. That would deny a new package the opportunity to build a user base and possibly someday evolve to become the "vastly better" alternative itself. -Kamal ka...@whence.com ka...@debian.org [1] http://duff.sourceforge.net/ signature.asc Description: This is a digitally signed message part
Bug#656142: ITP: duff -- Duplicate file finder
>> What is it the benefit over fdupes, rdfind, ...? > ..., hardlink, ... finddup from perforate > Was thinking about packaging it myself already, so I may also sponsor > Kamal's package when it's ready. You just listed the third duplicate (and me no. 4), and still go blind right on "ohoh, i sponsor it". Why? I hope its conditional on it being vastly better than any of the others (speed, functionality, ...) and not just "because". Contrary to some common believe, Debian is not the dump for NIH, and even if a little redundancy can't hurt, too much is just waste. Of our time, of our mirrors (space and bandwidth), ... -- bye, Joerg Contrary to common belief, Arch:i386 is *not* the same as Arch: any. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#656142: ITP: duff -- Duplicate file finder
Hi, Samuel Thibault wrote: > > * Package name: duff > > Version : 0.5 > > Upstream Author : Camilla Berglund > > * URL : http://duff.sourceforge.net/ > > * License : Zlib > > Programming Lang: C > > Description : Duplicate file finder > > > > Duff is a command-line utility for identifying duplicates in a given set of > > files. It attempts to be usably fast and uses the SHA family of message > > digests as a part of the comparisons. > > What is it the benefit over fdupes, rdfind, ...? ..., hardlink, ... Some of my coworkers prefer duff over the tools available in Debian, too. I'm though no more sure why, but it's possible that speed was one argument, because they ran it over several TB of data. Will check what exactly was the reason back then. Was thinking about packaging it myself already, so I may also sponsor Kamal's package when it's ready. Regards, Axel -- ,''`. | Axel Beckert , http://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE `-| 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#656142: ITP: duff -- Duplicate file finder
Kamal Mostafa, le Mon 16 Jan 2012 12:58:13 -0800, a écrit : > Package: wnpp > Severity: wishlist > Owner: Kamal Mostafa > > > * Package name: duff > Version : 0.5 > Upstream Author : Camilla Berglund > * URL : http://duff.sourceforge.net/ > * License : Zlib > Programming Lang: C > Description : Duplicate file finder > > Duff is a command-line utility for identifying duplicates in a given set of > files. It attempts to be usably fast and uses the SHA family of message > digests as a part of the comparisons. What is it the benefit over fdupes, rdfind, ...? Samuel -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120116210316.gs4...@type.famille.thibault.fr -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Bug#656142: ITP: duff -- Duplicate file finder
Package: wnpp Severity: wishlist Owner: Kamal Mostafa * Package name: duff Version : 0.5 Upstream Author : Camilla Berglund * URL : http://duff.sourceforge.net/ * License : Zlib Programming Lang: C Description : Duplicate file finder Duff is a command-line utility for identifying duplicates in a given set of files. It attempts to be usably fast and uses the SHA family of message digests as a part of the comparisons. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org