My only thought is that "city" makes me uncomfortable. Did we track down a precise use case for that in the end?
On 5 June 2015 at 09:25, Dan Andreescu <[email protected]> wrote: > I just posted a comment on the famous task: > https://phabricator.wikimedia.org/T44259#1341010 :) > > Here it is for those who would rather discuss on this list: > > > We have finished analyzing the intermediate hourly aggregate with all the > columns that we think are interesting. The data is too large to query and > anonymize in real time. We'd rather get an API out faster than deal with > that problem, so we decided to produce smaller "cubes" [1] of data for > specific purposes. We have two cubes in mind and I'll explain those here. > For each cube, we're aiming to have: > > * Direct access to a postgresql database in labs with the data > * API access through RESTBase > * Mondrian / Saiku access in labs for dimensional analysis > * Data will be pre-aggregated so that any single data point has k-anonymity > (we have not determined a good k yet) > * Higher level aggregations will be pre-computed so they use all data > > And, the cubes are: > > **stats.grok.se Cube: basic pageview data** > > Hourly resolution. Will serve the same purpose as stats.grok.se has served > for so many years. The dimensions available will be: > > * project - 'Project name from requests host name' > * dialect - 'Dialect from requests path (not set if present in project > name)' > * page_title - 'Page Title from requests path and query' > * access_method - 'Method used to access the pages, can be desktop, mobile > web, or mobile app' > * is_zero - 'accessed through a zero provider' > * agent_type - 'Agent accessing the pages, can be spider or user' > * referer_class - 'Can be internal, external or unknown' > > > **Geo Cube: geo-coded pageview data** > > Daily resolution. Will allow researchers to track the flu, breaking news, > etc. Dimensions will be: > > * project - 'Project name from requests hostname' > * page_title - 'Page Title from requests path and query' > * country_code - 'Country ISO code of the accessing agents (computed using > MaxMind GeoIP database)' > * province - 'State / Province of the accessing agents (computed using > MaxMind GeoIP database)' > * city - 'Metro area of the accessing agents (computed using MaxMind GeoIP > database)' > > > So, if anyone wants another cube, **now** is the time to speak up. We'll > probably add cubes later, but it may be a while. > > [1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
