On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> I think examining the rsync algorithms should convince you that finding 
> common chunks can be fairly efficient.

Note that "efficient" really depends on how good a job you want to do, so 
you can tune it to how much CPU you can afford to waste on the problem.

For example, my example had this thing where we merged five different
functions into one function, and it is truly pretty efficient to find
things like that _IF_ we only look at the files that changed (since the
set of files that change in any one particular commit tends to be small,
relative to the whole repository). There are many good algorithms for 
finding "common code", and with modern hardware that is basically 
instantaneous if you look at a few tens of files.

For example, people wrote efficient things to compare _millions_ of lines 
of code for the whole SCO saga - you can do quite well. Some googling 
comes up with for example


and applying those to a smallish set of files is quite efficient.

What is _not_ necessarily as easy is the situation where you notice that a 
new set of lines appeared, but you don't see any place that matches that 
set of lines in the set of CHANGED files. That's actually quite common, ie 
let's say that you have a new filesystem or a new driver, and almost 
always it's based on a template or something, and you _would_ be able to 
see where that template came from, except it's not in that _changed_ set.

And that is still doable, but now you really have to compare against the
whole tree if you want to do it. Even _that_ is actually efficient if you
cache the hashes - that's how the comparison tools compare two totally
independent trees against each other, and it makes it practically possible
to do even that expensive O(n**2) operation in reasonable time. It's
certainly possible to do exactly the same thing for the "new code got
added, does it bear any similarity to old code" case.

Note! This is a question that is relevant and actually is in the realm of
the "possible to find the answer interactively".  It may fairly expensive, 
but the point is that this is the kind of relevant question that really 
does depend on the fundamental notion that "data matters more than any 
local changes". And when you think about the problem in that form, you 
find these kinds of interesting questions that you _can_ answer.

Because the way git identifies data, the example "is there any other
relevant code that may actually be similar to the newly added code" is
actually not that hard to do in git. Remember: the way to answer that
question is to have a cache of hashes of the contents. Guess what git
_is_? You can now index your line-based hashes of contents against the
_object_ hashes that git keeps track of, and you suddenly have an
efficient way to actually look up those hashes.

NOTE! All of this is outside the scope of git itself. This is all
"visualization and comparison tools" built up on top of git. And I'm not
at all interested in writing those tools myself, and I'm absolutely not
signing up for that part. All I'm arguing for is that the git architecture
is actually a very good architecture for doing these kinds of very very
cool tools.

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to