As mentioned before, most file systems have the ability of providing
callback to changes in files/directories, at SE 6 using this ability
requires resorting to native code (JNA or JNI, see 
http://jnotify.sourceforge.net/)
, in SE 7 this will be implemented in NIO 2 (http://java.sun.com/
developer/technicalArticles/javase/nio/).

I would look into rsync, as far as i know it implements a very
efficient algorithm for comparing folders, there is a Java
implementation http://jarsync.sourceforge.net/

On May 29, 4:09 pm, Korny Sietsma <ko...@sietsma.com> wrote:
> I keep reading this thread when I should be going to bed :)  Sadly,
> this stuff is in what I call my "0.2% time" so I'm not working very
> hard on it right now.
>
> The ruby code, which basically works (but is rather ugly in parts)
> relies on reading the whole file tree into memory, and then traversing
> the tree, saving each node (file or dir) in a repostory, which
> internally detects duplicates as they are added.
>
> The repository stores all known nodes, indexed by size, and then
> grouped into clumps of identical nodes.  When you add a new node, it
> compares it to all clumps of the same size as the node, looking for an
> identical clump; if it finds one, the new node is added to the clump,
> otherwise it forms a new clump.
> (sorry if this is a bit of a vague description, I haven't worked on
> this code for a while so all the details are a bit vague)
>
> Once the repository is built, it's pretty easy to throw away all
> non-duplicate nodes, and report on the duplicates.
>
> This works, but has some issues:
> - it's got some kind-of ugly handling for some special cases, like
> making sure a directory doesn't match it's own children if it only has
> one child
> - it's a bit slow, and uses a lot of memory
> - it *only* handles exact matches.
>
> I've been playing with shingling and sketching algorithms that are
> used by search engines to identify nearly-identical documents, and I
> think they could be applied to this problem; in fact I suspect they
> could speed it up considerably.  (If you want to know more about this
> the best reference online seems to be the book "Introduction to
> Information Retrieval" which is 
> athttp://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
> - the chapter most relevant is 
> athttp://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-...
> )
>
> But, like I said, this is my 0.2% time project, so it might be some
> time before I really do more than think about this. :)
>
> - Korny
>
>
>
> On Fri, May 29, 2009 at 4:51 PM, Daniel Lyons <fus...@storytotell.org> wrote:
>
> > For whatever reason I just can't seem to put this problem down.
>
> > I have rewritten the code substantially. A major bottleneck was using
> > Java's MD5 classes. The "Fast MD5" library really is, and that helped
> > a lot. I did get the -> notation to work and I have a reasonable HOF
> > now for doing the winnowing, which might even be applicable to another
> > program someday, maybe.
>
> > Anyway I uploaded it here: 
> > <http://clojure.googlegroups.com/web/dupfinder.clj
> >  > and again I'd love any feedback anyone cares to give.
>
> > Just to add insult to injury, I went ahead and programmed it again in
> > Ruby. The good news is that I can't seem to get Ruby to find all the
> > files the Clojure one finds, but the bad news is that the Ruby version
> > is like four times faster. I'd love to understand that. So I uploaded
> > that too: <http://clojure.googlegroups.com/web/dupfinder.rb>. Of
> > course it must be benefiting to some extent from the fact that the
> > Ruby version has a lot less abstraction, but I don't see how the
> > approach is fundamentally any different or why there would be such a
> > large performance disparity. I must be missing something big.
>
> > On May 28, 2009, at 6:50 AM, Korny Sietsma wrote:
> >> By the way, in response to whoever suggested pre-sorting files; I
> >> sort-of do this (in the old ruby version) but actually, mostly the
> >> program is looking for duplicate *directories* of files - the goal is
> >> to point it at my archive disk, and have it find the biggest identical
> >> subdirectories.  Duplicate file checking is needed for this, but it's
> >> only a tiny part.
>
> >> And I'm playing with sketching algorithms at work right now, which
> >> look very handy for the next phase, which is to find the biggest
> >> *similar* subdirectories.  That's the real goal - point a program at a
> >> terabyte archive disk, and have it spit out :
> >> "/archive/old_disks/laptop_2007a is 312gb and 99% similar to
> >> /archive/misc/stuff_from_2007"
> >> ... or sorting by file count:
> >> "/archive/source/old_projects/c_stuff/1996 is 20,324 files and 97%
> >> similar to /archive/old/disks/laptop2006/unsorted/old_drives/
> >> old_archive/c_cpp_stuff/90s"
>
> > I can think of three ways to approach this, none of which are
> > particularly easy.
>
> > The first is to take the duplicate file finding function and look for
> > common suffixes of paths. It could almost be like running a fuzzy
> > duplicate finder against your duplicates. I suspect the performance
> > here will blow for no particular reason I can put my finger on.
>
> > The second approach would be to use something like rsync's rolling
> > checksum. I bet you could use a rather stupid heuristic to find
> > candidate directories for similarity and then apply this algorithm
> > amongst the candidates. It would be handy if you could have a version
> > of diff that would look at two directories recursively and give up if
> > they were sufficiently different.
>
> > Another option would be to ignore the filesystem altogether and look
> > for runs of similar blocks on the device directly. For some reason I
> > think this will perform better but it'll probably be harder to reason
> > about during development and I doubt it will be a cakewalk from Java.
>
> > In any event, that's a very interesting problem, but for some reason
> > it's setting off my alarm bells. I think if it's going to be useful
> > it'll probably wind up being exponential time, and that compromises to
> > make it run acceptably will probably wind up diminishing the value. I
> > have no idea why I think these things though... maybe I'm just up too
> > late.
>
> > Another point: most OSes and filesystems have a change notification
> > API now (or an emulation of one) which you could use to maintain
> > indices, if you want to do something more than once. This probably
> > isn't relevant to this problem though. It's on my mind because I was
> > just recently hacking on something for a friend that uses Mac OS X's
> > API to do that.
>
> > Hope any of this helps and looking forward to (all) your feedback, :)
>
> > —
> > Daniel Lyons
> >http://www.storytotell.org-- Tell It!
>
> --
> Kornelis Sietsma  korny at my surname dot com
> "Every jumbled pile of person has a thinking part
> that wonders what the part that isn't thinking
> isn't thinking of"
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to