On Thu, 14 Apr 2005, Paul Jackson wrote:

To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.

I wonder if there might be someway to use the tools that biologists use
to analyze DNA sequences, to track the evolution of source code,
identifying things like common chunks of code that differ in just a few
mutations, and presenting the history of the evolution, at selectable
levels of detail.

The rsync algorithm (http://samba.anu.edu.au/rsync/tech_report/node2.html) is probably a good place to start, although it is relatively sensitive to mutations. It will be able to efficiently detect identical blocks larger than some block size N (512 bytes or so for rsync). You might well consider smaller blocks to be irrelevant. The data can be made considerably more useful to developers by canonicalizing before searching (ie, compressing whitespace to ' ', etc)[*]. Note that the identical regions do *not* have to line up on block boundaries; see the rsync algorithm for more detail.

I think Linus has made a persuasive case that the 'developer-friendly' features of an SCM (ie annotate, log, and friends) can be built *on top* of GIT. This is a perfect example. Since the computation is non-trivial (although linear in the number of lines of code involved in the history of a file; ie doesn't depend on the unrelated size of the archive), it might make sense for the front-end SCM to maintain its own caches --- for example, of the block and rolling checksums for each file required by the rsync algorithm. The key point being that these are just *caches*, not essential history information, and can always be wiped and regenerated.

The nice 'feature' of this system (some may disagree, I guess) is that it does *not* depend on extensive programmer annotation of file changes (ie, chunk A in file B came from lines C-D of file D, or file E was once named F, etc). By inferring history from content-similar files and blocks, it seems that it would be more able to generate useful results after importing third-party sources, which may come in distinct 'releases' but lack explicit history annotations.

[*] in general, i will be *glad* to see source-management move away from CVS' line-oriented style; there's no good reason we should still be worrying
about whitespace changes, etc. When we build 'developer-friendly' tools we should make every effort to auto-detect source code, image formats, etc, and automatically perform appropriate canonicalization and beautification of diffs, because this can be/should be/is entirely separate from git's underlying storage representation.

( http://cscott.net/ )
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Reply via email to