Re: Standards for mail archive statistics gathering?

Steve Blackmon Thu, 07 May 2015 15:53:11 -0700

I'd like to submit to this group that no batch job is necessary to compute
many useful statistics - rather with a suitable representation an indexed
event stream of mailing list messages, commits, releases, etc... can be
searched aggregated and visualized in real-time.


Hacked a bit this afternoon - parsed much of community-dev mbox history
using mime4j into activity streams json, indexed in elasticsearch with
kibana as UI.

Visit link below for an idea what an indexed activity streams
representation of a mailing list could look like.

http://72.182.111.65:5601/#/discover?_a=(columns:!(actor.displayName,published,content,summary),index:community-dev_activity,interval:auto,query:(query_string:(analyze_wildcard:!t,query:'*')),sort:!(published,asc))&_g=(time:(from:'2009-10-07T22:26:18.843Z',mode:absolute,to:'2015-05-07T22:26:18.843Z'))

Of course much more discussion and rigor would be required before something
like this could become official: determining appropriate
structure/identifier/format/enumeration of each field, adding robust error
handling, testing that no messages are lost in translation, resolving email
addresses back to apache LDAP ids, etc... but I wanted to show the
potential of this approach and what can be developed with minimal net new
code.

All code used to build this has been pushed to
http://github.com/steveblackmon/streams-apache

Regards,

Steve Blackmon
sblack...@apache.org

On Wed, May 6, 2015 at 10:44 PM, Hervé BOUTEMY <herve.bout...@free.fr>
wrote:
> Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit :
>> > For visualization, for sure, json is the current natural format when
data
>> > is consumed from the browser.
>> > I don't have great experience on this, and what I'm missing with json
>> > currently is a common practice on documenting a structure: are there
>> > common
>> > practices?
>>
>> In podling streams [0], we make extensive use of json schema [1]
> thank you: that's exactly the initial info I was looking for: json schema!
>
>> from
>> which we generate POJOs with a maven
>> plugin jsonschema2pojo [2] which makes manipulating the objects in
>> Java/Scala pleasant.  I expect other languages have
>> similar jsonschema-based ORM paradigms as well.
> As usual Java devloper, your tooling is interesting
> But in the projects-new.a.o case, it is data extraction is coded in
Python: if
> we create json schema, having Python classes generated could simplify
coding.
> Anyone with Python+json schema experience around?
>
>
>> This pattern supports
>> inheritance both within
>> and across projects - for example see how [3] extends [4] which
>> extends [5].  These schemas are relatively self documenting,
>> but generating documentation or other artifacts is straight-forward as
>> they are themselves json documents.
> yeah, json schema document is easy to read (at least the examples on the
> site...)
>
>>
>> > Because for simple json structure, documentation is not really
necessary,
>> > but once the structure goes complex, documentation is really a key
>> > requirement for people to use or extend. And I already see this
>> > shortcoming with the 11 json files from projects-new.a.o =
>> > https://projects-new.apache.org/json/foundation/
>> Having used these json documents a few weeks ago to build an apache
>> community visualization [6]
> yeah, really nice visualization!
>
>> IMO the current crop of project-new jsons
>> are intermediate artifacts rather than a sufficiently cross-purpose
>> data model, a role currently held by DOAP mbox and misc others all
>> with some inherent shortcomings most notably lack of navigability
>> between silos.
> +1
> I'm at a point where I start to really understand the concepts involved
and
> want to code a simple data model: I'll report here once I have a first
version
> available.
>
>> I'd like to nominate activity streams [7] with
>> community-specific extensions (such as those roughly prototyped here:
>> [8] ) as a potential core data model for this effort going forward
> I had a first look at it: it is more complex than what I had in mind
> We'll have to share and see what's the best bet
>
>> and
>> I'm happy to help apply some of the useful tools and connectors within
>> podling streams toward that end. Converting external structured
>> sources into normalized documents and indexing those activities to
>> power data-centric APIs and visualizations are wheelhouse use cases
>> for this project, as they say.
> Great, stay tuned: I'll probably work on it this week-end
>
> Regards,
>
> Hervé
>
>>
>> [0] http://streams.incubator.apache.org/
>> [1] http://json-schema.org/documentation.html
>> [2] http://www.jsonschema2pojo.org/
>> [3]
>>
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/
>> main/jsonschema/objectTypes/committee.json [4]
>>
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
>> in/jsonschema/objectTypes/group.json [5]
>>
https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma
>> in/jsonschema/object.json [6] http://72.182.111.65:3000/workspace/3
>> [7] http://activitystrea.ms/
>> [8]
>>
https://github.com/steveblackmon/streams-apache/blob/master/activities/src/
>> main/jsonschema
>>
>> Steve Blackmon
>> sblack...@apache.org
>>
>> On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY <herve.bout...@free.fr>
wrote:
>> > Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit :
>> >> On 5/5/15 7:33 AM, Boris Baldassari wrote:
>> >> > Hi Folks,
>> >> >
>> >> > Sorry for the late answer on this thread. Don't know what has been
done
>> >> > since then, but I've some experience to share on this, so here are
my
>> >> > 2c..
>> >>
>> >> No, more input is always appreciated!  Hervé is doing some
>> >> centralization of the projects-new.a.o data capture, which is related
>> >> but slightly separate.
>> >
>> > +1
>> > this can give a common place to put code once experiments show that we
>> > should add a new data source
>> >
>> >> But this is going to be a long-term project
>> >
>> > +1
>> >
>> >> with
>> >> plenty of different people helping I bet.
>> >
>> > I hope so...
>> >
>> >> ...
>> >>
>> >> > * Parsing mboxes for software repository data mining:
>> >> > There is a suite of tools exactly targeted at this kind of duty on
>> >> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2].
I
>> >> > don't know how they manage time zones, but the toolsuite is widely
used
>> >> > around (see [3] or [4] as examples) so I believe they are quite
robust.
>> >> > It includes tools for data retrieval as well as visualisation.
>> >>
>> >> Drat.  Metrics Grimoire looks pretty nifty - essentially a set of
>> >> frameworks for extracting metadata from a bunch of sources - but it's
>> >> GPL, so personally I have no interest in working on it.  If someone
else
>> >> uses it to generate datasets that's great.
>> >>
>> >> > * As for the feedback/thoughts about the architecture and formats:
>> >> > I love the REST-API idea proposed by Rob. That's really easy to
access
>> >> > and retrieve through scripts on-demand. CSV and JSON are my
favourite
>> >> > formats, because they are, again, easy to parse and widely used --
>> >> > every
>> >> > language and library has some facility to read them natively.
>> >>
>> >> Yup - again, like project visualization, to make any of this simple
for
>> >> newcomers to try stuff, we need to separate data gathering / model /
>> >> visualization.  Since most of these are spare time projects, having
easy
>> >> chunks makes it simpler for different people to try their hand at it.
>> >
>> > For visualization, for sure, json is the current natural format when
data
>> > is consumed from the browser.
>> > I don't have great experience on this, and what I'm missing with json
>> > currently is a common practice on documenting a structure: are there
>> > common
>> > practices?
>> > Because for simple json structure, documentation is not really
necessary,
>> > but once the structure goes complex, documentation is really a key
>> > requirement for people to use or extend. And I already see this
>> > shortcoming with the 11 json files from projects-new.a.o =
>> > https://projects-new.apache.org/json/foundation/
>> >
>> > Regards,
>> >
>> > Hervé
>> >
>> >> Thanks,
>> >>
>> >> - Shane
>

Re: Standards for mail archive statistics gathering?

Reply via email to