See also https://phabricator.wikimedia.org/T117945 and https://phabricator.wikimedia.org/T108867 for possibly related oddities in the top viewed pages. (And https://phabricator.wikimedia.org/T104755 : "Wikimedia's URL-routing logic straddles five layers ...")
(switching CC to the intended Dan) On Fri, Jan 22, 2016 at 3:17 PM, Ryan Kaldari <[email protected]> wrote: > Any idea why the most popular article in India is "-"? CCing Dan Garry of > Discovery team. > > On Fri, Jan 22, 2016 at 5:13 PM, Tilman Bayer <[email protected]> wrote: >> >> Below is an example Hive query yielding the 50 most viewed pages in >> India during December 2015. It took less than 10 minutes of wall clock >> time to complete. >> >> SELECT CONCAT('https://',project,'.org/wiki/',page_title), >> SUM(view_count) AS views >> FROM wmf.pageview_hourly >> WHERE >> year = 2015 >> AND month = 12 >> AND country = "India" >> AND agent_type = "user" >> GROUP BY project, page_title >> ORDER BY views DESC LIMIT 50; >> >> ... >> Total MapReduce CPU Time Spent: 0 days 19 hours 13 minutes 2 seconds 930 >> msec >> OK >> _c0 views >> https://en.wikipedia.org/wiki/Main_Page 43515253 >> https://en.wikipedia.org/wiki/Special:Search 4818687 >> https://en.wikipedia.org/wiki/- 2650346 >> https://en.wikipedia.org/wiki/Bajirao_I 1414810 >> https://en.wikipedia.org/wiki/Dilwale_(2015_film) 1410015 >> https://en.wikipedia.org/wiki/Mastani 1232964 >> https://en.wikipedia.org/wiki/Bajirao_Mastani_(film) 1133261 >> https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2015 632890 >> https://en.wikipedia.org/wiki/Hate_Story_3 582816 >> https://en.wikipedia.org/wiki/Special:MobileMenu 499379 >> https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens 438113 >> https://en.wikipedia.org/wiki/Tamasha_(film) 390519 >> https://en.wikipedia.org/wiki/Prem_Ratan_Dhan_Payo 378133 >> https://en.wikipedia.org/wiki/India 368946 >> https://en.wikipedia.org/wiki/List_of_Bollywood_films_of_2016 335547 >> https://en.wikipedia.org/wiki/Star_Wars 334326 >> https://en.wikipedia.org/wiki/Sunny_Leone 333848 >> https://en.wikipedia.org/wiki/Sundar_Pichai 329264 >> https://en.wikipedia.org/wiki/Special:Book 324255 >> https://en.wikipedia.org/wiki/List_of_highest-grossing_Bollywood_films >> 321418 >> https://en.wikipedia.org/wiki/Salman_Khan 309113 >> https://en.wikipedia.org/wiki/'Tis_the_Season 308221 >> https://en.wikipedia.org/wiki/Mandana_Karimi 289662 >> https://en.wikipedia.org/wiki/Kyaa_Kool_Hain_Hum_3 281801 >> https://en.wikipedia.org/wiki/Kashibai 272673 >> https://en.wikipedia.org/wiki/Bigg_Boss_9 272203 >> https://en.wikipedia.org/wiki/Kriti_Sanon 266773 >> https://en.wikipedia.org/wiki/2012_Delhi_gang_rape 265296 >> https://en.wikipedia.org/wiki/Shah_Rukh_Khan 263729 >> https://en.wikipedia.org/wiki/Neerja_Bhanot 259410 >> https://en.wikipedia.org/wiki/Nora_Fatehi 252085 >> https://en.wikipedia.org/wiki/Ashoka 250255 >> https://en.wikipedia.org/wiki/B._K._S._Iyengar 248422 >> https://en.wikipedia.org/wiki/2015_South_Indian_floods 246377 >> https://en.wikipedia.org/wiki/Baahubali:_The_Beginning 244281 >> https://en.wikipedia.org/wiki/Shamsher_Bahadur_I_(Krishna_Rao) 232122 >> https://en.wikipedia.org/wiki/Christmas 228278 >> https://en.wikipedia.org/wiki/Thanga_Magan_(2015_film) 222373 >> https://en.wikipedia.org/wiki/Ranveer_Singh 221010 >> https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam 220612 >> https://en.wikipedia.org/wiki/Shivaji 218245 >> https://en.wikipedia.org/wiki/Deepika_Padukone 218242 >> https://en.wikipedia.org/wiki/TLC:_Tables,_Ladders_and_Chairs_(2015) >> 211920 >> https://en.wikipedia.org/wiki/Gizele_Thakral 206585 >> https://en.wikipedia.org/wiki/Urvashi_Rautela 204305 >> https://en.wikipedia.org/wiki/Peshwa 194957 >> https://en.wikipedia.org/wiki/Kajol 192044 >> https://hi.wikipedia.org/wiki/मुखपृष्ठ 184274 >> https://en.wikipedia.org/wiki/Quantico_(TV_series) 183112 >> https://en.wikipedia.org/wiki/Mahatma_Gandhi 182336 >> Time taken: 562.621 seconds, Fetched: 50 row(s) >> >> >> See also the discussion at https://phabricator.wikimedia.org/T120113 >> (As mentioned there, a while ago I retrieved the global top 200 pages >> for a timespan of almost six months, with some wait time but no major >> issues. It's not quite clear to me why the "brute force" approach >> mentioned in the ticket failed, but I guess it had to do with the >> difficulty of repeating such a query for all projects - or countries - >> to generate top lists for every one of them.) >> >> On Wed, Jan 20, 2016 at 12:42 PM, Kevin Leduc <[email protected]> wrote: >> > +Analytics list so they can comment. >> > >> > I don't have such a script. It's a pretty intensive job to compile top >> > articles especially over a month. The pageview API was supposed to have >> > top >> > articles per month per wiki but the job is so massive that it failed to >> > run >> > in Hive. Analytics knows there are better algorithms out there to solve >> > this problem. So the pageview API just has top per day per wiki. >> > >> > I imagine that you are looking at some very specific wikis and >> > countries... >> > not all of them. Maybe someone on the list can make an example hive >> > script >> > (given a wiki and country) that gives the top for a day. >> > >> > >> > On Wed, Jan 20, 2016 at 12:23 PM, Dan Foy <[email protected]> wrote: >> >> >> >> Hi Kevin, >> >> >> >> In your collection of scripts for Hive, do you have one that can act as >> >> a >> >> starting point for me to get the top N articles / URLs for Wikipedia in >> >> a >> >> country? >> >> >> >> Thanks, >> >> Dan >> >> >> >> >> > >> > >> > _______________________________________________ >> > Analytics mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/analytics >> > >> >> >> >> -- >> Tilman Bayer >> Senior Analyst >> Wikimedia Foundation >> IRC (Freenode): HaeB >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > -- Tilman Bayer Senior Analyst Wikimedia Foundation IRC (Freenode): HaeB _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
