Re: Standards for mail archive statistics gathering?

Shane Curcuru Tue, 28 Apr 2015 06:18:02 -0700

On 4/27/15 2:29 PM, Rob Weir wrote:
> On Mon, Apr 27, 2015 at 9:36 AM, Shane Curcuru <a...@shanecurcuru.org> wrote:
...
> If you do Python, you might take a look at
> https://svn.apache.org/repos/asf/openoffice/devtools/list-stats/ for a
> simple program that could be adapted easily enough.   It uses the
> Python mailbox library to do the parsing.


ACK, will look at.  Yes, I started with a python library, but my issue
is finding a chunk of time to start, code, and actually finish any one
piece, so having a starting place is what I need.

> 
> The biggest challenge making sense of such data, for me at least, was
> the multiple email addresses a single person can use.   Determining
> these aliases for a project you are involved in is possible, though
> tedious.   Doing it for an unfamiliar project borders on the
> impossible.

Yes - a huge part of the value is in identity tracking.  Many committer
records now have alternate emails filled in in the LDAP data that is
behind id.apache.org, and Members certainly can work with infra to get
access, so we certainly can do this for most Apache lists.

> Another "fun" problem is getting all the post time data into the same
> UTC timezone.   The mbox format does not seem to enforce a consistent
> way of encoding these.

Ah, good point.  I was going to start cheap and simply categorize by
calendar day, and call it good enough.

> 
> I see I have a few other analysis scripts on my harddrive I haven't
> checked in that handle the TZ and other issues.   I'll get those
> checked in.   It seems that, almost as good as pre-extracted data
> would be an easy API.
> 
> 
> Ever think of having a contest related to "Visualizing Apache"?   I
> was considering proposing something like that for OpenOffice.
> Provide the data for download (already extracted from our transaction
> systems, so we don't get a harmful about of load on those servers) and
> invite the community to do the analysis, see what insights they can
> generate.

Yes, that's exactly why I want to treat this as an actual architecture,
so to speak.  Really separate out data finding from parsing from
identity matching, and then just find some interim format that
visualization people could just look at.  Makes it much simpler for a
volunteer or someone with limited time to accomplish a real task to only
have to focus on one bit.

- Shane

> 
> Regards,
> 
> -Rob
> 
> 
>> Thanks,
>> - Shane

Re: Standards for mail archive statistics gathering?

Reply via email to