I have a directory which has a number of files in it. I want to find out
which files have identical content. Please, don't ask why (I'm an idiot?).
Since these are text files, my first thought was to use diff. That is, list
the files. For each file, do a diff against all the other files and note
the result. I never came up with a decent algorithm to do this. Then I had
a "vision". I remember that git stores file contents by basically creating
a sha1sum, which it uses as a file name. Multiple files with the same
sha1sum (which very likely to be unique based on the content) are only
stored one. Now, since sha1sum is very unlikely to have a collision, how
likely would sha512sum be to have a collision. So I did the following:

for i in *;do x=$(sha512sum "$i" | cut -d ' ' -f 1);echo "$i"
>>"${x}.sha512sum";done

I then did:

wc -l *.sha512sum | head -n -1 | awk '$1 != 1 {print $2;}'|while read i;do
echo '===';cat $i;done

which gave me a nice list of files with each group separated by ===.

Is this reasonable? Is there a better way to do this?

--
"Pessimism is a admirable quality in an engineer. Pessimistic people check
their work three times, because they're sure that something won't be right.
Optimistic people check once, trust in Solis-de to keep the ship safe, then
blow everyone up."
"I think you're mistaking the word optimistic for inept."
"They've got a similar ring to my ear."

>From "Star Nomad" by Lindsay Buroker:

Maranatha! <><
John McKown

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

Reply via email to