On 4/27/15 2:29 PM, Rob Weir wrote: > On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru <a...@shanecurcuru.org> wrote: ... > If you do Python, you might take a look at > https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a > simple program that could be adapted easily enough. It uses the > Python mailbox library to do the parsing.
ACK, will look at. Yes, I started with a python library, but my issue is finding a chunk of time to start, code, and actually finish any one piece, so having a starting place is what I need. > > The biggest challenge making sense of such data, for me at least, was > the multiple email addresses a single person can use. Determining > these aliases for a project you are involved in is possible, though > tedious. Doing it for an unfamiliar project borders on the > impossible. Yes - a huge part of the value is in identity tracking. Many committer records now have alternate emails filled in in the LDAP data that is behind id.apache.org, and Members certainly can work with infra to get access, so we certainly can do this for most Apache lists. > Another "fun" problem is getting all the post time data into the same > UTC timezone. The mbox format does not seem to enforce a consistent > way of encoding these. Ah, good point. I was going to start cheap and simply categorize by calendar day, and call it good enough. > > I see I have a few other analysis scripts on my harddrive I haven't > checked in that handle the TZ and other issues. I'll get those > checked in. It seems that, almost as good as pre-extracted data > would be an easy API. > > > Ever think of having a contest related to "Visualizing Apache"? I > was considering proposing something like that for OpenOffice. > Provide the data for download (already extracted from our transaction > systems, so we don't get a harmful about of load on those servers) and > invite the community to do the analysis, see what insights they can > generate. Yes, that's exactly why I want to treat this as an actual architecture, so to speak. Really separate out data finding from parsing from identity matching, and then just find some interim format that visualization people could just look at. Makes it much simpler for a volunteer or someone with limited time to accomplish a real task to only have to focus on one bit. - Shane > > Regards, > > -Rob > > >> Thanks, >> - Shane