My only thought is that "city" makes me uncomfortable. Did we track
down a precise use case for that in the end?

On 5 June 2015 at 09:25, Dan Andreescu <[email protected]> wrote:
> I just posted a comment on the famous task:
> https://phabricator.wikimedia.org/T44259#1341010 :)
>
> Here it is for those who would rather discuss on this list:
>
>
> We have finished analyzing the intermediate hourly aggregate with all the
> columns that we think are interesting.  The data is too large to query and
> anonymize in real time.  We'd rather get an API out faster than deal with
> that problem, so we decided to produce smaller "cubes" [1] of data for
> specific purposes.  We have two cubes in mind and I'll explain those here.
> For each cube, we're aiming to have:
>
> * Direct access to a postgresql database in labs with the data
> * API access through RESTBase
> * Mondrian / Saiku access in labs for dimensional analysis
> * Data will be pre-aggregated so that any single data point has k-anonymity
> (we have not determined a good k yet)
> * Higher level aggregations will be pre-computed so they use all data
>
> And, the cubes are:
>
> **stats.grok.se Cube: basic pageview data**
>
> Hourly resolution.  Will serve the same purpose as stats.grok.se has served
> for so many years.  The dimensions available will be:
>
> * project - 'Project name from requests host name'
> * dialect - 'Dialect from requests path (not set if present in project
> name)'
> * page_title - 'Page Title from requests path and query'
> * access_method - 'Method used to access the pages, can be desktop, mobile
> web, or mobile app'
> * is_zero - 'accessed through a zero provider'
> * agent_type - 'Agent accessing the pages, can be spider or user'
> * referer_class - 'Can be internal, external or unknown'
>
>
> **Geo Cube: geo-coded pageview data**
>
> Daily resolution.  Will allow researchers to track the flu, breaking news,
> etc.  Dimensions will be:
>
> * project - 'Project name from requests hostname'
> * page_title - 'Page Title from requests path and query'
> * country_code - 'Country ISO code of the accessing agents (computed using
> MaxMind GeoIP database)'
> * province - 'State / Province of the accessing agents (computed using
> MaxMind GeoIP database)'
> * city - 'Metro area of the accessing agents (computed using MaxMind GeoIP
> database)'
>
>
> So, if anyone wants another cube, **now** is the time to speak up.  We'll
> probably add cubes later, but it may be a while.
>
> [1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to