>Take disk files as an example. Hashing files (ignoring the name) >would be a saner way to discover whether you have duplicate files on your >disk than to compare every file with every other. I actually played with this many years ago when I wrote a utility to traverse a UNIX file system looking for duplicate files and to link the duplicates together to recover disk space. My first program produced a MD5 hash of every file and sorted the hash results, just as you suggest. That worked, but I later came up with a better (faster) algorithm that doesn't involve hashing. I simply quicksort the list of files using a comparison function that, as a side effect, links duplicate files together. Then I rescan the sorted list comparing adjacent entries to mop up any remaining duplicates. The vast number of duplicates are found and deleted by the comparisons in the quicksort phase; relatively few are found by the subsequent re-scan. I also applied the obvious performance enhancements, such as caching the results of fstat() operations in memory. This algorithm was considerably faster than hashing because a typical filesystem has many files with unique sizes, and some of these files can be quite big. Once the comparison function discovers that two files have different sizes, there's no need to go any further. You don't need to hash them or even compare them. This is a big win when you have lots of large files. Phil
