Re: "clever"(?) way to find files with duplicate contents.

Christian Borntraeger Wed, 06 Jul 2016 06:47:33 -0700

On 07/06/2016 03:35 PM, John McKown wrote:
> I have a directory which has a number of files in it. I want to find out
> which files have identical content. Please, don't ask why (I'm an idiot?).
> Since these are text files, my first thought was to use diff. That is, list
> the files. For each file, do a diff against all the other files and note
> the result. I never came up with a decent algorithm to do this. Then I had
> a "vision". I remember that git stores file contents by basically creating
> a sha1sum, which it uses as a file name. Multiple files with the same
> sha1sum (which very likely to be unique based on the content) are only
> stored one. Now, since sha1sum is very unlikely to have a collision, how
> likely would sha512sum be to have a collision. So I did the following:
>
> for i in *;do x=$(sha512sum "$i" | cut -d ' ' -f 1);echo "$i"
>>> "${x}.sha512sum";done
>
> I then did:
>
> wc -l *.sha512sum | head -n -1 | awk '$1 != 1 {print $2;}'|while read i;do
> echo '===';cat $i;done
>
> which gave me a nice list of files with each group separated by ===.
>
> Is this reasonable? Is there a better way to do this?


Have you checked the "fdupes" tool?

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Re: "clever"(?) way to find files with duplicate contents.

Reply via email to