----- Original Message -----
> From: "Justin Lebar" <justin.le...@gmail.com>
> To: "Benjamin Smedberg" <benja...@smedbergs.us>
> Cc: "Benoit Jacob" <jacob.benoi...@gmail.com>, "Josh Aas" 
> <josh...@gmail.com>, dev-platform@lists.mozilla.org
> Sent: Thursday, February 28, 2013 9:14:52 AM
> Subject: Re: improving access to telemetry data
> 
> It sounds to me like people want both
> 
> 1) Easier access to aggregated data so they can build their own
> dashboards roughly comparable in features to the current dashboards.
> 
> 2) Easier access to raw databases so that people can build up more
> complex analyses, either by exporting the raw data from the db, or by
> analyzing it in the db.
> 
> That is, I don't think we can or should export JSON with all the data
> in our databases.  That is a lot of data.

I've used telemetry data a little bit for finding information about addon 
usage. It took me a while to figure out how to use Pig and run Hadoop jobs, and 
it would be great to have something easier to use. Based on what little I know, 
it seems like a lot of queries fit the following scheme:

1. Filter based on version and/or buildid as well as the product 
(Firefox/TB/Fennec).
2. Select a random sample of x% of all pings.
3. Dump out the JSON and process it in Python or via some other external tool.

This, at least, was sufficient for what I was doing. It sounds like it would 
also work for many of the applications people have suggested so far as well, 
although I might be misunderstanding.

It sounds like we might be able to come up with a few generic queries that 
could run each day. One could be for Nightly data with yesterday's buildid and 
another could be for recent Aurora submissions, etc. The data would be randomly 
sampled to generate a compressed JSON file of some reasonable size (maybe 
100MB) that would then be uploaded to an FTP server that everyone could access. 
The old files would be thrown away after a few weeks, although we could archive 
a few in case someone wants older data.

I'm sure that this wouldn't cover every single use case of telemetry. However, 
it could be used both for dashboards and to get the raw data. The random 
sampling seems like the biggest potential problem. However, you could compare 
data across a few days to see how significant the results are. At the very 
least, this data would make it easy to try out prototypes. Once you find 
something that works, you could create a more customized query that would be 
more specific to the application.

-Bill
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to