Sure, I was assuming at least 50 of the links-from-ns-0 would be on the 5000 list (I haven't checked, though), making the filtering efforst much faster.
Maybe you or someone else can come up with a mediawiki patch for a new special page? On Sat, Sep 23, 2017 at 7:30 PM, John <phoenixoverr...@gmail.com> wrote: > However there is one key difference in the query vs WantedPages if you take > your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k > incoming links, however if you filter for only ns 0 links you will discover > that it actually has zero incoming links. Not only was I attempting to only > get results from ns 0 but also looking to get only ns 0 based links. (Yes I > know some templates can still taint the results) but it will stop templates > such as https://en.wikipedia.org/wiki/Template:WikiProject_Israel from > completely making the report useless. Because of the size of most wiki > projects, if they add a wanted pages section to their talk page banner the > validity of WantedPages becomes significantly skewed. > > On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <jcre...@wikimedia.org> wrote: >> >> John, >> >> Despite your query returning only 50 results, it is unlikely to finish- >> ever :-) >> >> Your query is reading (almost) every single row of pagelinks (1 >> billion rows), -which is also by itself not very optimized-, and >> joining it with page (35 million rows), and then sorting all results, >> all without using a proper index, then discarding all of them except >> the top 50. >> >> I would wonder if you have considered using the cached functionality >> found at https://en.wikipedia.org/wiki/Special:WantedPages but if you >> absolutely have to run your query? Despite the wrong overriden >> mediawiki messages show there, those are updated twice a month: This >> is the same list as shown there on namespace 0: >> >> root@dbstore1001[enwiki]> SELECT qc_title as page, qc_value as >> numlinks FROM querycache WHERE qc_type = >> 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50; >> +--------------------------------------------+----------+ >> | page | numlinks | >> +--------------------------------------------+----------+ >> | HaRav_Moshe_Nehemia_Cohenov | 19729 | >> | HaRav_Meir_Shehenboim_(Etz_Hadar) | 19728 | >> | Wrestling_at_the_2010_South_American_Games | 12659 | >> | Post-structuralist_ballet | 12025 | >> | Robert_MacDonnell | 10541 | >> | Charles_Michael_Lavery_QC | 10540 | >> | Alison_Campbell | 9747 | >> | Shauna_Gunn | 9744 | >> | Sophie_Hoopman | 9742 | >> | District_Policing_Partnerships | 9741 | >> | Stanley_Leisure | 9740 | >> | Robert_Hall_(doctor) | 9739 | >> | Hunt_Committee | 9738 | >> | Brian_Baird_(newsreader) | 9738 | >> | Choo_Chiat_Goh | 7483 | >> | Michael_Cook_(ballet) | 7480 | >> | Zoulikha_Naciri | 7127 | >> | Mounir_Charïbi | 7127 | >> | Mohamed_Moatassim | 7127 | >> | Khalil_Hachimi_Idrissi | 7127 | >> | Hassan_Aourid | 7127 | >> | Faycal_Laraichi | 7127 | >> | Campocraspedon | 6982 | >> | Al-Masdar | 6811 | >> | Mackay_Radio | 6809 | >> | Lebanese_literature | 6042 | >> | Al_Nabad | 6007 | >> | Badran_Roy_Badran | 6006 | >> | Mohammad_Abdul_Jabbar_Khairy | 6001 | >> | Toufic_Awwad | 5999 | >> | Maurice_Chehab | 5999 | >> | Ghazi_Aad | 5999 | >> | Kawasaki_Vulcan_500_LTD | 5724 | >> | Buddhism_in_Cuba | 5706 | >> | Thomas_steam_bicycle | 5698 | >> | Cementerio_de_Santa_Ifigenia | 5696 | >> | Cuban_Liberation_Army | 5695 | >> | Fidel_Castro_Díaz-Balart | 5694 | >> | Consejo_popular_(Cuba) | 5694 | >> | Moto_Guzzi_V7 | 5687 | >> | Dirt_Rider | 5676 | >> | Dick_O'Brien | 5673 | >> | Motus_Motorcycles | 5671 | >> | Greenhorn_Enduro | 5671 | >> | Werner_Motocyclette | 5670 | >> | Vietnam_Veterans_Motorcycle_Club | 5670 | >> | Moto_Guzzi_Eldorado | 5670 | >> | Moto_Guzzi_Ambassador | 5670 | >> | Bimota_Tesi | 5670 | >> | Sara_Liberte | 5669 | >> +--------------------------------------------+----------+ >> 50 rows in set (0.00 sec) >> >> Last updated 11 days ago: >> root@dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE >> qci_type='Wantedpages'; >> +-------------+----------------+ >> | qci_type | qci_timestamp | >> +-------------+----------------+ >> | Wantedpages | 20170912074914 | >> +-------------+----------------+ >> 1 row in set (0.04 sec) >> >> This table, querycache, could be replicated to the wiki replicas, but >> we need someone to help with the sanitization (remove the unwatched >> pages info). >> >> On Sat, Sep 23, 2017 at 2:40 AM, John <phoenixoverr...@gmail.com> wrote: >> > The query I am using in this case is: >> > >> > select pl_title, count(*) from pagelinks left join page on pl_title = >> > page_title where pl_namespace = 0 and pl_from_namespace = 0 and >> > page_id >> > IS NULL group by pl_title having COUNT(*) > 100 LIMIT 50; >> > >> > I am also running this on enwiki where the table is probably the >> > largest. I >> > am attempting to locate red linked pages via number of incoming links. I >> > would expect it to take an hour or more to run probably, given the size >> > of >> > the table. But I am getting the error message in the phab ticket: ERROR >> > 2006 >> > (HY000): MySQL server has gone away >> > >> > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan >> > <mviswanat...@wikimedia.org> wrote: >> >> >> >> Hi John, >> >> >> >> Please provide us with a bit more information so we can help - which >> >> host >> >> are you connecting to, what long running queries are you trying to run, >> >> etc. >> >> I believe you may be running into some timeout/mechanisms that we have >> >> set >> >> up to prevent abuse, see - >> >> https://phabricator.wikimedia.org/T76956#948591. >> >> >> >> As always, feel free to reach out to us on #wikimedia-cloud for any >> >> questions. >> >> >> >> Thanks, >> >> >> >> On Fri, Sep 22, 2017 at 1:29 PM, John <phoenixoverr...@gmail.com> >> >> wrote: >> >>> >> >>> Do we have a way of letting SQL queries run for a day or so? Im using >> >>> the >> >>> interactive sql <host> and running them that way but the DB host is >> >>> going >> >>> away in the middle of the process. >> >>> >> >>> _______________________________________________ >> >>> Labs-l mailing list >> >>> Labs-l@lists.wikimedia.org >> >>> https://lists.wikimedia.org/mailman/listinfo/labs-l >> >>> >> >> >> >> >> >> >> >> -- >> >> --Madhu :) >> >> >> >> _______________________________________________ >> >> Labs-l mailing list >> >> Labs-l@lists.wikimedia.org >> >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> >> >> > >> > >> > _______________________________________________ >> > Labs-l mailing list >> > Labs-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/labs-l >> > >> >> >> >> -- >> Jaime Crespo >> <http://wikimedia.org> >> >> _______________________________________________ >> Labs-l mailing list >> Labs-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > _______________________________________________ > Labs-l mailing list > Labs-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/labs-l > -- Jaime Crespo <http://wikimedia.org> _______________________________________________ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l