On Tue, 2015-09-01 at 11:30 -0400, Eric S. Raymond wrote:
> Joseph Myers <jos...@codesourcery.com>:
> > With 227369 revisions I don't think adding git-style summary lines is 
> > really practical without some very reliable automation to match commits to 
> > corresponding gcc-patches messages (whose Subject: headers would be the 
> > natural choice for such summary lines)....
> 
> In this case you may be right.  Select =L tells me there are 101139
> commits wanting that sort of adjustment, which I think is at least
> 2.5x the bulk I've ever had to deal with before.
> 
> Still, if anyone else is brave enough to write a script that will munch
> through gcc-patches producing committer/date/subject-line triples, I'll
> give it a try.

I don't think committer/date/subject-line triples are adequate: the
dates are unlikely to match up, for one thing.

I think such a solution would need to somehow locate and match patches
themselves.

I was feeling brave, so I had a go at writing a scraper; see:
https://github.com/davidmalcolm/patch-finder
for what I have so far (tested with Python 2.7).

This can scrape the gcc-patches archives and locate mails containing
patches, extracting the patches (some of them anyway...).  The idea
would be to stuff the patches into some kind of big data store, and
somehow them try to locate them (perhaps within a rough date "window").

Does this seem like a viable approach?

Caution: this script performs numerous URL GETs on gcc.gnu.org;
it caches everything, but the first time you run it, the cache
will be cold.  (So please be careful!)

> About scale:  The largest repository I've dealt with before this was
> NetBSD, with a working set of 18GB, vs 45GB for this one.  The way 
> reposurgeon's
> internal representations work, working set is dominated by comment text.  So
> the GCC repo has about 2.5x the comment bulk of NetBSD.


Reply via email to