Awesome, Hay, thanks!

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Hay (Husky)
Sent: Wednesday, March 25, 2015 11:03
To: A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics.
Subject: Re: [Analytics] [Announce] New daily feed: media file request counts

Answering my own question: until somebody puts up a stats.grok.se-like 
interface for the mediacounts, i've hacked together a Python script that can be 
used to 'query' the TSV files with a file, or a list of
files:

https://github.com/hay/wiki-tools/blob/master/etc/mediacounts-stats.py

-- Hay

On Wed, Mar 25, 2015 at 8:05 AM, Maarten Brinkerink 
<[email protected]> wrote:
> Dear Erik,
>
> Thanks for pointing to this nice development! Since I’m not so 
> technical, I was wondering to what extend this development helps us 
> reach the vision and requirements that have been described by Maarten 
> Zeinstra as part of his research for the GW Toolset project 
> (https://commons.wikimedia.org/wiki/Commons:GLAMwiki_Toolset_Project)?
>
> See:
> https://commons.wikimedia.org/wiki/File:Report_on_requirements_for_usa
> ge_and_reuse_statistics_for_GLAM_content.pdf
>
> Best,
>
> Maarten
>
> Op 24 mrt. 2015, om 20:47 heeft Jane Darnell <[email protected]> het 
> volgende geschreven:
>
> +1 - I just crashed my spreadsheet trying to open one .tsv file. But 
> +great
> news indeed Erik - this is an important first step!
>
> On Tue, Mar 24, 2015 at 8:42 PM, Hay (Husky) <[email protected]> wrote:
>>
>> Awesome! I'm especially glad that more statistics than 'just' the 
>> image views are included, like the aggregated views for thumbnails, 
>> and the media files as well. I just hope somebody will built a tool 
>> in the near future like stats.grok.se so we can view statistics for 
>> individual files and/or sets of files a la Bagalama2.
>>
>> -- Hay
>>
>> On Tue, Mar 24, 2015 at 6:39 PM, Erik Zachte <[email protected]>
>> wrote:
>> > Today WMF Analytics announces a new product: a daily feed of media 
>> > file request counts for all Wikimedia projects [1].
>> >
>> > The counts are based on unsampled data, so any single request 
>> > within the defined scope [2] will contribute to the counts.
>> >
>> > It can be seen as complimentary to our page view counts files [5].
>> >
>> > The file layout is documented on wikitech [3].
>> >
>> > Daily counts have been backfilled from January 1, 2015 onwards.
>> >
>> >
>> >
>> > Additionally there is a daily zip file which contains a small 
>> > subset of these raw counts: top 1000 most requested media files, 
>> > one csv file for each column [7]. As these csv files have headers 
>> > (not so easy to add in Hive) you may want to start with this file 
>> > for a first impression (best open in spreadsheet program).
>> >
>> >
>> >
>> > The counts are collected from our Hadoop system, using a Hive 
>> > query, with data markup done in UDF scripts. This feed hopefully 
>> > addresses a long standing request, expressed often and by many, 
>> > which we regrettably couldn't fulfil earlier, as our pre-Hadoop 
>> > infrastructure and processing capacity were not up to the task.
>> >
>> >
>> >
>> > An initial draft design (RFC) was presented last November at the 
>> > Amsterdam Hackaton 2014 (GLAM and Wikidata).
>> >
>> > Online consultation followed, leading to the current design [4].
>> >
>> >
>> >
>> > This is a data feed with production status, but not the final 
>> > release, as there is one major issue that hasn't been addressed yet 
>> > (but progress is being made):
>> >
>> > When using Media viewer to view images, some images are prefetched 
>> > for better user experience, but these may never be shown to the user.
>> > Currently,
>> > those prefetched images are getting counted, as there is no way to 
>> > detect whether an image was actually shown to the user or not.
>> >
>> > Gilles Dubuc and other colleagues worked on a solution that would 
>> > not hamper performance (a tough challenge) and would help us 
>> > discern viewed from non-viewed files. A few days ago a patch was 
>> > published! Adaptation of the Hive query will follow later. [6] 
>> > Also, and related, context tagging isn't supported yet. [9]
>> >
>> >
>> >
>> > Huge thanks to all people who contributed to the process so far, 
>> > and still do.
>> >
>> > Special thanks to Christian Aistleitner with whom I co-authored the 
>> > design, and who also wrote the Hive implementation.
>> >
>> >
>> >
>> > Erik Zachte
>> >
>> >
>> >
>> > [1] http://dumps.wikimedia.org/other/mediacounts/
>> >
>> > [2]
>> >
>> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
>> > est_counts#Filtering
>> >
>> > [3] https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts
>> >
>> > [4]
>> >
>> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
>> > est_counts
>> >
>> > [5]
>> > https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-s
>> > ites
>> >
>> >       (a new version of this data feed is in the works)
>> >
>> > [6] https://phabricator.wikimedia.org/T89088
>> >
>> > [7] Before you ask: no plans yet for further aggregation into 
>> > monthly or yearly top ranking files. The current csv files are 
>> > quick wins, using standard Linux tools.
>> >
>> > [8] https://www.mediawiki.org/wiki/Multimedia/Media_Viewer
>> >
>> > [9]
>> >
>> > https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_requ
>> > est_counts#by_context
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>> >
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to