I'd like to submit to this group that no batch job is necessary to compute many useful statistics - rather with a suitable representation an indexed event stream of mailing list messages, commits, releases, etc... can be searched aggregated and visualized in real-time.
Hacked a bit this afternoon - parsed much of community-dev mbox history using mime4j into activity streams json, indexed in elasticsearch with kibana as UI. Visit link below for an idea what an indexed activity streams representation of a mailing list could look like. http://72.182.111.65:5601/#/discover?_a=(columns:!(actor.displayName,published,content,summary),index:community-dev_activity,interval:auto,query:(query_string:(analyze_wildcard:!t,query:'*')),sort:!(published,asc))&_g=(time:(from:'2009-10-07T22:26:18.843Z',mode:absolute,to:'2015-05-07T22:26:18.843Z')) Of course much more discussion and rigor would be required before something like this could become official: determining appropriate structure/identifier/format/enumeration of each field, adding robust error handling, testing that no messages are lost in translation, resolving email addresses back to apache LDAP ids, etc... but I wanted to show the potential of this approach and what can be developed with minimal net new code. All code used to build this has been pushed to http://github.com/steveblackmon/streams-apache Regards, Steve Blackmon sblack...@apache.org On Wed, May 6, 2015 at 10:44 PM, Hervé BOUTEMY <herve.bout...@free.fr> wrote: > Le mercredi 6 mai 2015 12:48:34 Steve Blackmon a écrit : >> > For visualization, for sure, json is the current natural format when data >> > is consumed from the browser. >> > I don't have great experience on this, and what I'm missing with json >> > currently is a common practice on documenting a structure: are there >> > common >> > practices? >> >> In podling streams [0], we make extensive use of json schema [1] > thank you: that's exactly the initial info I was looking for: json schema! > >> from >> which we generate POJOs with a maven >> plugin jsonschema2pojo [2] which makes manipulating the objects in >> Java/Scala pleasant. I expect other languages have >> similar jsonschema-based ORM paradigms as well. > As usual Java devloper, your tooling is interesting > But in the projects-new.a.o case, it is data extraction is coded in Python: if > we create json schema, having Python classes generated could simplify coding. > Anyone with Python+json schema experience around? > > >> This pattern supports >> inheritance both within >> and across projects - for example see how [3] extends [4] which >> extends [5]. These schemas are relatively self documenting, >> but generating documentation or other artifacts is straight-forward as >> they are themselves json documents. > yeah, json schema document is easy to read (at least the examples on the > site...) > >> >> > Because for simple json structure, documentation is not really necessary, >> > but once the structure goes complex, documentation is really a key >> > requirement for people to use or extend. And I already see this >> > shortcoming with the 11 json files from projects-new.a.o = >> > https://projects-new.apache.org/json/foundation/ >> Having used these json documents a few weeks ago to build an apache >> community visualization [6] > yeah, really nice visualization! > >> IMO the current crop of project-new jsons >> are intermediate artifacts rather than a sufficiently cross-purpose >> data model, a role currently held by DOAP mbox and misc others all >> with some inherent shortcomings most notably lack of navigability >> between silos. > +1 > I'm at a point where I start to really understand the concepts involved and > want to code a simple data model: I'll report here once I have a first version > available. > >> I'd like to nominate activity streams [7] with >> community-specific extensions (such as those roughly prototyped here: >> [8] ) as a potential core data model for this effort going forward > I had a first look at it: it is more complex than what I had in mind > We'll have to share and see what's the best bet > >> and >> I'm happy to help apply some of the useful tools and connectors within >> podling streams toward that end. Converting external structured >> sources into normalized documents and indexing those activities to >> power data-centric APIs and visualizations are wheelhouse use cases >> for this project, as they say. > Great, stay tuned: I'll probably work on it this week-end > > Regards, > > Hervé > >> >> [0] http://streams.incubator.apache.org/ >> [1] http://json-schema.org/documentation.html >> [2] http://www.jsonschema2pojo.org/ >> [3] >> https://github.com/steveblackmon/streams-apache/blob/master/activities/src/ >> main/jsonschema/objectTypes/committee.json [4] >> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma >> in/jsonschema/objectTypes/group.json [5] >> https://github.com/apache/incubator-streams/blob/master/streams-pojo/src/ma >> in/jsonschema/object.json [6] http://72.182.111.65:3000/workspace/3 >> [7] http://activitystrea.ms/ >> [8] >> https://github.com/steveblackmon/streams-apache/blob/master/activities/src/ >> main/jsonschema >> >> Steve Blackmon >> sblack...@apache.org >> >> On Wed, May 6, 2015 at 2:05 AM, Hervé BOUTEMY <herve.bout...@free.fr> wrote: >> > Le mardi 5 mai 2015 21:26:36 Shane Curcuru a écrit : >> >> On 5/5/15 7:33 AM, Boris Baldassari wrote: >> >> > Hi Folks, >> >> > >> >> > Sorry for the late answer on this thread. Don't know what has been done >> >> > since then, but I've some experience to share on this, so here are my >> >> > 2c.. >> >> >> >> No, more input is always appreciated! Hervé is doing some >> >> centralization of the projects-new.a.o data capture, which is related >> >> but slightly separate. >> > >> > +1 >> > this can give a common place to put code once experiments show that we >> > should add a new data source >> > >> >> But this is going to be a long-term project >> > >> > +1 >> > >> >> with >> >> plenty of different people helping I bet. >> > >> > I hope so... >> > >> >> ... >> >> >> >> > * Parsing mboxes for software repository data mining: >> >> > There is a suite of tools exactly targeted at this kind of duty on >> >> > github: Metrics Grimoire [1], developed (and used) by Bitergia [2]. I >> >> > don't know how they manage time zones, but the toolsuite is widely used >> >> > around (see [3] or [4] as examples) so I believe they are quite robust. >> >> > It includes tools for data retrieval as well as visualisation. >> >> >> >> Drat. Metrics Grimoire looks pretty nifty - essentially a set of >> >> frameworks for extracting metadata from a bunch of sources - but it's >> >> GPL, so personally I have no interest in working on it. If someone else >> >> uses it to generate datasets that's great. >> >> >> >> > * As for the feedback/thoughts about the architecture and formats: >> >> > I love the REST-API idea proposed by Rob. That's really easy to access >> >> > and retrieve through scripts on-demand. CSV and JSON are my favourite >> >> > formats, because they are, again, easy to parse and widely used -- >> >> > every >> >> > language and library has some facility to read them natively. >> >> >> >> Yup - again, like project visualization, to make any of this simple for >> >> newcomers to try stuff, we need to separate data gathering / model / >> >> visualization. Since most of these are spare time projects, having easy >> >> chunks makes it simpler for different people to try their hand at it. >> > >> > For visualization, for sure, json is the current natural format when data >> > is consumed from the browser. >> > I don't have great experience on this, and what I'm missing with json >> > currently is a common practice on documenting a structure: are there >> > common >> > practices? >> > Because for simple json structure, documentation is not really necessary, >> > but once the structure goes complex, documentation is really a key >> > requirement for people to use or extend. And I already see this >> > shortcoming with the 11 json files from projects-new.a.o = >> > https://projects-new.apache.org/json/foundation/ >> > >> > Regards, >> > >> > Hervé >> > >> >> Thanks, >> >> >> >> - Shane >