On Mon, Nov 2, 2015 at 10:27 AM, Nuria Ruiz <[email protected]> wrote: > Team: > > Please take a look at Mediawiki API data needs, they made a nice wiki page > for us to understand what type of data do they need. > > https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics > > We already talked with them about using our user_agent data on wmf table so > they can start on those reports right away so you might see some oozie CRs > on that regard. Please have in mind that API folks need raw user agents (as > every API client should have a unique one) rather than processed ones.
I've updated the wiki page with some refined ideas and now have a short section at the end that gives some really rough numbers that I've taken from the existing wmf.webrequests data for 2015-11-01. Some interesting things there at least for me and the people I've shared these early findings with: * api.php gets hit 450M+ times a day by 300K+ distinct user-agents * 65 user-agents each make >1M requests per day * The top user-agent is no user agent at all (missing/empty header) * Only 1% of Action API traffic comes from WMF servers (excluding labs) * The existing ua based classification of "spider" misses a lot of user-agents that are obviously bots * We have a lot of API consumers that are violating the posted policy of using a unique ua for requests I have a couple of small patches up for review [0][1] to introduce a new UDF that can be used to classify an IP address as coming from an internal, external or labs host. I hope to have some oozie and hive scripts for review by the end of next week. [0]: https://gerrit.wikimedia.org/r/#/c/253045/ [1]: https://gerrit.wikimedia.org/r/#/c/253046/ Bryan -- Bryan Davis Wikimedia Foundation <[email protected]> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
