----- Original Message ----- > From: "Justin Lebar" <justin.le...@gmail.com> > To: "Benjamin Smedberg" <benja...@smedbergs.us> > Cc: "Benoit Jacob" <jacob.benoi...@gmail.com>, "Josh Aas" > <josh...@gmail.com>, dev-platform@lists.mozilla.org > Sent: Thursday, February 28, 2013 9:14:52 AM > Subject: Re: improving access to telemetry data > > It sounds to me like people want both > > 1) Easier access to aggregated data so they can build their own > dashboards roughly comparable in features to the current dashboards. > > 2) Easier access to raw databases so that people can build up more > complex analyses, either by exporting the raw data from the db, or by > analyzing it in the db. > > That is, I don't think we can or should export JSON with all the data > in our databases. That is a lot of data.
I've used telemetry data a little bit for finding information about addon usage. It took me a while to figure out how to use Pig and run Hadoop jobs, and it would be great to have something easier to use. Based on what little I know, it seems like a lot of queries fit the following scheme: 1. Filter based on version and/or buildid as well as the product (Firefox/TB/Fennec). 2. Select a random sample of x% of all pings. 3. Dump out the JSON and process it in Python or via some other external tool. This, at least, was sufficient for what I was doing. It sounds like it would also work for many of the applications people have suggested so far as well, although I might be misunderstanding. It sounds like we might be able to come up with a few generic queries that could run each day. One could be for Nightly data with yesterday's buildid and another could be for recent Aurora submissions, etc. The data would be randomly sampled to generate a compressed JSON file of some reasonable size (maybe 100MB) that would then be uploaded to an FTP server that everyone could access. The old files would be thrown away after a few weeks, although we could archive a few in case someone wants older data. I'm sure that this wouldn't cover every single use case of telemetry. However, it could be used both for dashboards and to get the raw data. The random sampling seems like the biggest potential problem. However, you could compare data across a few days to see how significant the results are. At the very least, this data would make it easy to try out prototypes. Once you find something that works, you could create a more customized query that would be more specific to the application. -Bill _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform