John, Despite your query returning only 50 results, it is unlikely to finish- ever :-)
Your query is reading (almost) every single row of pagelinks (1 billion rows), -which is also by itself not very optimized-, and joining it with page (35 million rows), and then sorting all results, all without using a proper index, then discarding all of them except the top 50. I would wonder if you have considered using the cached functionality found at https://en.wikipedia.org/wiki/Special:WantedPages but if you absolutely have to run your query? Despite the wrong overriden mediawiki messages show there, those are updated twice a month: This is the same list as shown there on namespace 0: root@dbstore1001[enwiki]> SELECT qc_title as page, qc_value as numlinks FROM querycache WHERE qc_type = 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50; +--------------------------------------------+----------+ | page | numlinks | +--------------------------------------------+----------+ | HaRav_Moshe_Nehemia_Cohenov | 19729 | | HaRav_Meir_Shehenboim_(Etz_Hadar) | 19728 | | Wrestling_at_the_2010_South_American_Games | 12659 | | Post-structuralist_ballet | 12025 | | Robert_MacDonnell | 10541 | | Charles_Michael_Lavery_QC | 10540 | | Alison_Campbell | 9747 | | Shauna_Gunn | 9744 | | Sophie_Hoopman | 9742 | | District_Policing_Partnerships | 9741 | | Stanley_Leisure | 9740 | | Robert_Hall_(doctor) | 9739 | | Hunt_Committee | 9738 | | Brian_Baird_(newsreader) | 9738 | | Choo_Chiat_Goh | 7483 | | Michael_Cook_(ballet) | 7480 | | Zoulikha_Naciri | 7127 | | Mounir_Charïbi | 7127 | | Mohamed_Moatassim | 7127 | | Khalil_Hachimi_Idrissi | 7127 | | Hassan_Aourid | 7127 | | Faycal_Laraichi | 7127 | | Campocraspedon | 6982 | | Al-Masdar | 6811 | | Mackay_Radio | 6809 | | Lebanese_literature | 6042 | | Al_Nabad | 6007 | | Badran_Roy_Badran | 6006 | | Mohammad_Abdul_Jabbar_Khairy | 6001 | | Toufic_Awwad | 5999 | | Maurice_Chehab | 5999 | | Ghazi_Aad | 5999 | | Kawasaki_Vulcan_500_LTD | 5724 | | Buddhism_in_Cuba | 5706 | | Thomas_steam_bicycle | 5698 | | Cementerio_de_Santa_Ifigenia | 5696 | | Cuban_Liberation_Army | 5695 | | Fidel_Castro_Díaz-Balart | 5694 | | Consejo_popular_(Cuba) | 5694 | | Moto_Guzzi_V7 | 5687 | | Dirt_Rider | 5676 | | Dick_O'Brien | 5673 | | Motus_Motorcycles | 5671 | | Greenhorn_Enduro | 5671 | | Werner_Motocyclette | 5670 | | Vietnam_Veterans_Motorcycle_Club | 5670 | | Moto_Guzzi_Eldorado | 5670 | | Moto_Guzzi_Ambassador | 5670 | | Bimota_Tesi | 5670 | | Sara_Liberte | 5669 | +--------------------------------------------+----------+ 50 rows in set (0.00 sec) Last updated 11 days ago: root@dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE qci_type='Wantedpages'; +-------------+----------------+ | qci_type | qci_timestamp | +-------------+----------------+ | Wantedpages | 20170912074914 | +-------------+----------------+ 1 row in set (0.04 sec) This table, querycache, could be replicated to the wiki replicas, but we need someone to help with the sanitization (remove the unwatched pages info). On Sat, Sep 23, 2017 at 2:40 AM, John <[email protected]> wrote: > The query I am using in this case is: > > select pl_title, count(*) from pagelinks left join page on pl_title = > page_title where pl_namespace = 0 and pl_from_namespace = 0 and page_id > IS NULL group by pl_title having COUNT(*) > 100 LIMIT 50; > > I am also running this on enwiki where the table is probably the largest. I > am attempting to locate red linked pages via number of incoming links. I > would expect it to take an hour or more to run probably, given the size of > the table. But I am getting the error message in the phab ticket: ERROR 2006 > (HY000): MySQL server has gone away > > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan > <[email protected]> wrote: >> >> Hi John, >> >> Please provide us with a bit more information so we can help - which host >> are you connecting to, what long running queries are you trying to run, etc. >> I believe you may be running into some timeout/mechanisms that we have set >> up to prevent abuse, see - https://phabricator.wikimedia.org/T76956#948591. >> >> As always, feel free to reach out to us on #wikimedia-cloud for any >> questions. >> >> Thanks, >> >> On Fri, Sep 22, 2017 at 1:29 PM, John <[email protected]> wrote: >>> >>> Do we have a way of letting SQL queries run for a day or so? Im using the >>> interactive sql <host> and running them that way but the DB host is going >>> away in the middle of the process. >>> >>> _______________________________________________ >>> Labs-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/labs-l >>> >> >> >> >> -- >> --Madhu :) >> >> _______________________________________________ >> Labs-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/labs-l >> > > > _______________________________________________ > Labs-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/labs-l > -- Jaime Crespo <http://wikimedia.org> _______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
