Cool stuff - I really should go to bed now, but I'll look at this further in the morning.
By the way, in response to whoever suggested pre-sorting files; I sort-of do this (in the old ruby version) but actually, mostly the program is looking for duplicate *directories* of files - the goal is to point it at my archive disk, and have it find the biggest identical subdirectories. Duplicate file checking is needed for this, but it's only a tiny part. And I'm playing with sketching algorithms at work right now, which look very handy for the next phase, which is to find the biggest *similar* subdirectories. That's the real goal - point a program at a terabyte archive disk, and have it spit out : "/archive/old_disks/laptop_2007a is 312gb and 99% similar to /archive/misc/stuff_from_2007" ... or sorting by file count: "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97% similar to /archive/old/disks/laptop2006/unsorted/old_drives/old_archive/c_cpp_stuff/90s" - Korny (who may have made some bad decisions in the past about how to archive files. Many times. :) On Thu, May 28, 2009 at 7:06 PM, Timothy Pratley <timothyprat...@gmail.com> wrote: > > Thanks for the tip about lazy = Mikio! > > Wow Daniel, a very thorough description there - it seems file systems > are close to your heart :) I've taken your design and implemented a > variant on it: > > http://groups.google.com/group/clojure/web/find-duplicates.clj > > For this sort of domain I think memorizing is possibly a bad idea - > what if the file gets replaced - kind of defeats the purpose? > This solution only checks the minimal amount of information which is > great. But actually I suspect a more useful application would like to > check file additions also. We can avoid even more work if we are > willing to retain a uniqueness tree, however this is again susceptible > to the files changing underneath and would only work if all additions > and removals were controlled. > > In retrospect, reduce-by pretty much mirrors group-by... I should have > looked at that first! > > > Regards, > Tim. > > > > > > -- Kornelis Sietsma korny at my surname dot com "Every jumbled pile of person has a thinking part that wonders what the part that isn't thinking isn't thinking of" --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---