Awesome, Hay, thanks! -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Hay (Husky) Sent: Wednesday, March 25, 2015 11:03 To: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics. Subject: Re: [Analytics] [Announce] New daily feed: media file request counts
Answering my own question: until somebody puts up a stats.grok.se-like interface for the mediacounts, i've hacked together a Python script that can be used to 'query' the TSV files with a file, or a list of files: https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py -- Hay On Wed, Mar 25, 2015 at 8:05 AM, Maarten Brinkerink <[email protected]> wrote: > Dear Erik, > > Thanks for pointing to this nice development! Since I’m not so > technical, I was wondering to what extend this development helps us > reach the vision and requirements that have been described by Maarten > Zeinstra as part of his research for the GW Toolset project > (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)? > > See: > https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usa > ge_and_reuse_statistics_for_GLAM_content.pdf > > Best, > > Maarten > > Op 24 mrt. 2015, om 20:47 heeft Jane Darnell <[email protected]> het > volgende geschreven: > > +1 - I just crashed my spreadsheet trying to open one .tsv file. But > +great > news indeed Erik - this is an important first step! > > On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) <[email protected]> wrote: >> >> Awesome! I'm especially glad that more statistics than 'just' the >> image views are included, like the aggregated views for thumbnails, >> and the media files as well. I just hope somebody will built a tool >> in the near future like stats.grok.se so we can view statistics for >> individual files and/or sets of files a la Bagalama2. >> >> -- Hay >> >> On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte <[email protected]> >> wrote: >> > Today WMF Analytics announces a new product: a daily feed of media >> > file request counts for all Wikimedia projects [1]. >> > >> > The counts are based on unsampled data, so any single request >> > within the defined scope [2] will contribute to the counts. >> > >> > It can be seen as complimentary to our page view counts files [5]. >> > >> > The file layout is documented on wikitech [3]. >> > >> > Daily counts have been backfilled from January 1, 2015 onwards. >> > >> > >> > >> > Additionally there is a daily zip file which contains a small >> > subset of these raw counts: top 1000 most requested media files, >> > one csv file for each column [7]. As these csv files have headers >> > (not so easy to add in Hive) you may want to start with this file >> > for a first impression (best open in spreadsheet program). >> > >> > >> > >> > The counts are collected from our Hadoop system, using a Hive >> > query, with data markup done in UDF scripts. This feed hopefully >> > addresses a long standing request, expressed often and by many, >> > which we regrettably couldn't fulfil earlier, as our pre-Hadoop >> > infrastructure and processing capacity were not up to the task. >> > >> > >> > >> > An initial draft design (RFC) was presented last November at the >> > Amsterdam Hackaton 2014 (GLAM and Wikidata). >> > >> > Online consultation followed, leading to the current design [4]. >> > >> > >> > >> > This is a data feed with production status, but not the final >> > release, as there is one major issue that hasn't been addressed yet >> > (but progress is being made): >> > >> > When using Media viewer to view images, some images are prefetched >> > for better user experience, but these may never be shown to the user. >> > Currently, >> > those prefetched images are getting counted, as there is no way to >> > detect whether an image was actually shown to the user or not. >> > >> > Gilles Dubuc and other colleagues worked on a solution that would >> > not hamper performance (a tough challenge) and would help us >> > discern viewed from non-viewed files. A few days ago a patch was >> > published! Adaptation of the Hive query will follow later. [6] >> > Also, and related, context tagging isn't supported yet. [9] >> > >> > >> > >> > Huge thanks to all people who contributed to the process so far, >> > and still do. >> > >> > Special thanks to Christian Aistleitner with whom I co-authored the >> > design, and who also wrote the Hive implementation. >> > >> > >> > >> > Erik Zachte >> > >> > >> > >> > [1] http://dumps.wikimedia.org/other/mediacounts/ >> > >> > [2] >> > >> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ >> > est_counts#Filtering >> > >> > [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts >> > >> > [4] >> > >> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ >> > est_counts >> > >> > [5] >> > https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-s >> > ites >> > >> > (a new version of this data feed is in the works) >> > >> > [6] https://phabricator.wikimedia.org/T89088 >> > >> > [7] Before you ask: no plans yet for further aggregation into >> > monthly or yearly top ranking files. The current csv files are >> > quick wins, using standard Linux tools. >> > >> > [8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer >> > >> > [9] >> > >> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ >> > est_counts#by_context >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > Analytics mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
