Thanks for the overview and all the good work this year! --Mike
On Sat, Dec 3, 2016 at 11:38 AM Dan Andreescu <[email protected]> wrote: > We're starting to wrap up the calendar year, here's what we've > accomplished so far with Wikistats. We're really excited to have some data > in our production Hive database for people to play with. We worked really > hard to clean up and present an intuitive interface to all of mediawiki > history. The results are captured in the tables mentioned below, which > we'll cover more in an upcoming tech talk. Documentation for the project is > here <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>. > > Our goals so far and progress breakdown: > > 1. [done] Build pipeline to process and analyze *pageview* data > 2. [done] Load pageview data into an *API* > 3. [ ] *Sanitize* pageview data with more dimensions for public > consumption > 4. [ beta] Build pipeline to process and analyze *editing* data > 5. [ beta] Load editing data into an *API* > 6. [ ] *Sanitize* editing data for public consumption > 7. [ ] *Design* UI to organize dashboards built around new data > 8. [ ] Build enough *dashboards* to replace the main functionality > of stats.wikipedia.org > 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) > analytics.wikipedia.org > <http://analytics.wikipedia.org/>* > ***. [ ] Bonus: *replace dumps generation* based on the new data > pipelines > > 4 & 5. Since our last update, we've finished the pipeline that imports > data from mediawiki databases, cleans it up as best as possible, reshapes > it in a analytics-friendly way, and makes it easily queryable. I'm marking > these goals as "beta" because we're still tweaking the algorithm for > performance and productionizing the jobs. This will be completed early > next quarter, but in the meantime we have data for people to play with > internally. Sadly we haven't sanitized it yet so we can't publish it. For > those with internal access: > > * https://pivot.wikimedia.org/#edit-history-test is the full history > across all wikis. It's a bit hard to understand how to slice and dice, so > we will host a tech talk and present it at the January metrics meeting if > we can. > > * In hive, you can access this data in the wmf database, the tables are: > - wmf.mediawiki_history: denormalized full history with this schema > <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history> > - wmf.mediawiki_page_history: the sequence of states of each wiki page > (schema > <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_history> > ) > - wmf.mediawiki_user_history: the sequence of states of each user > account (schema > <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_history> > ) > > 6. Sanitizing has not moved forward, as we need DBA time and they've been > overloaded. We will attempt to restart this effort in Q3. > > 7. We have begun the design process, we'll share more about this as we go. > > Our goals and planning for next quarter support us finishing 4, 5, 7, and > 8, so basically putting a UI on top of the data pipeline we have in place, > and updating it weekly. We also hope to have good progress on 6, but > that depends on collaboration with the DBA team and is harder than we > originally imagined. > > And remember, voice your opinions about important reports in the current > Wikistats here: > https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_report > (thank you so so much to the many people who already chimed in). > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
