Cool stuff - I really should go to bed now, but I'll look at this
further in the morning.

By the way, in response to whoever suggested pre-sorting files; I
sort-of do this (in the old ruby version) but actually, mostly the
program is looking for duplicate *directories* of files - the goal is
to point it at my archive disk, and have it find the biggest identical
subdirectories.  Duplicate file checking is needed for this, but it's
only a tiny part.

And I'm playing with sketching algorithms at work right now, which
look very handy for the next phase, which is to find the biggest
*similar* subdirectories.  That's the real goal - point a program at a
terabyte archive disk, and have it spit out :
"/archive/old_disks/laptop_2007a is 312gb and 99% similar to
/archive/misc/stuff_from_2007"
... or sorting by file count:
"/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97%
similar to 
/archive/old/disks/laptop2006/unsorted/old_drives/old_archive/c_cpp_stuff/90s"

- Korny (who may have made some bad decisions in the past about how to
archive files. Many times. :)

On Thu, May 28, 2009 at 7:06 PM, Timothy Pratley
<timothyprat...@gmail.com> wrote:
>
> Thanks for the tip about lazy = Mikio!
>
> Wow Daniel, a very thorough description there - it seems file systems
> are close to your heart :) I've taken your design and implemented a
> variant on it:
>
>    http://groups.google.com/group/clojure/web/find-duplicates.clj
>
> For this sort of domain I think memorizing is possibly a bad idea -
> what if the file gets replaced - kind of defeats the purpose?
> This solution only checks the minimal amount of information which is
> great. But actually I suspect a more useful application would like to
> check file additions also. We can avoid even more work if we are
> willing to retain a uniqueness tree, however this is again susceptible
> to the files changing underneath and would only work if all additions
> and removals were controlled.
>
> In retrospect, reduce-by pretty much mirrors group-by... I should have
> looked at that first!
>
>
> Regards,
> Tim.
>
>
>
> >
>



-- 
Kornelis Sietsma  korny at my surname dot com
"Every jumbled pile of person has a thinking part
that wonders what the part that isn't thinking
isn't thinking of"

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to