However there is one key difference in the query vs WantedPages if you take your first example "HaRav_Moshe_Nehemia_Cohenov" is listed with almost 20k incoming links, however if you filter for only ns 0 links you will discover that it actually has zero incoming links. Not only was I attempting to only get results from ns 0 but also looking to get only ns 0 based links. (Yes I know some templates can still taint the results) but it will stop templates such as https://en.wikipedia.org/wiki/Template:WikiProject_Israel from completely making the report useless. Because of the size of most wiki projects, if they add a wanted pages section to their talk page banner the validity of WantedPages becomes significantly skewed.
On Sat, Sep 23, 2017 at 1:16 PM, Jaime Crespo <jcre...@wikimedia.org> wrote: > John, > > Despite your query returning only 50 results, it is unlikely to finish- > ever :-) > > Your query is reading (almost) every single row of pagelinks (1 > billion rows), -which is also by itself not very optimized-, and > joining it with page (35 million rows), and then sorting all results, > all without using a proper index, then discarding all of them except > the top 50. > > I would wonder if you have considered using the cached functionality > found at https://en.wikipedia.org/wiki/Special:WantedPages but if you > absolutely have to run your query? Despite the wrong overriden > mediawiki messages show there, those are updated twice a month: This > is the same list as shown there on namespace 0: > > root@dbstore1001[enwiki]> SELECT qc_title as page, qc_value as > numlinks FROM querycache WHERE qc_type = > 'Wantedpages' and qc_namespace=0 ORDER BY qc_value DESC LIMIT 50; > +--------------------------------------------+----------+ > | page | numlinks | > +--------------------------------------------+----------+ > | HaRav_Moshe_Nehemia_Cohenov | 19729 | > | HaRav_Meir_Shehenboim_(Etz_Hadar) | 19728 | > | Wrestling_at_the_2010_South_American_Games | 12659 | > | Post-structuralist_ballet | 12025 | > | Robert_MacDonnell | 10541 | > | Charles_Michael_Lavery_QC | 10540 | > | Alison_Campbell | 9747 | > | Shauna_Gunn | 9744 | > | Sophie_Hoopman | 9742 | > | District_Policing_Partnerships | 9741 | > | Stanley_Leisure | 9740 | > | Robert_Hall_(doctor) | 9739 | > | Hunt_Committee | 9738 | > | Brian_Baird_(newsreader) | 9738 | > | Choo_Chiat_Goh | 7483 | > | Michael_Cook_(ballet) | 7480 | > | Zoulikha_Naciri | 7127 | > | Mounir_Charïbi | 7127 | > | Mohamed_Moatassim | 7127 | > | Khalil_Hachimi_Idrissi | 7127 | > | Hassan_Aourid | 7127 | > | Faycal_Laraichi | 7127 | > | Campocraspedon | 6982 | > | Al-Masdar | 6811 | > | Mackay_Radio | 6809 | > | Lebanese_literature | 6042 | > | Al_Nabad | 6007 | > | Badran_Roy_Badran | 6006 | > | Mohammad_Abdul_Jabbar_Khairy | 6001 | > | Toufic_Awwad | 5999 | > | Maurice_Chehab | 5999 | > | Ghazi_Aad | 5999 | > | Kawasaki_Vulcan_500_LTD | 5724 | > | Buddhism_in_Cuba | 5706 | > | Thomas_steam_bicycle | 5698 | > | Cementerio_de_Santa_Ifigenia | 5696 | > | Cuban_Liberation_Army | 5695 | > | Fidel_Castro_Díaz-Balart | 5694 | > | Consejo_popular_(Cuba) | 5694 | > | Moto_Guzzi_V7 | 5687 | > | Dirt_Rider | 5676 | > | Dick_O'Brien | 5673 | > | Motus_Motorcycles | 5671 | > | Greenhorn_Enduro | 5671 | > | Werner_Motocyclette | 5670 | > | Vietnam_Veterans_Motorcycle_Club | 5670 | > | Moto_Guzzi_Eldorado | 5670 | > | Moto_Guzzi_Ambassador | 5670 | > | Bimota_Tesi | 5670 | > | Sara_Liberte | 5669 | > +--------------------------------------------+----------+ > 50 rows in set (0.00 sec) > > Last updated 11 days ago: > root@dbstore1001[enwiki]> SELECT * FROM querycache_info WHERE > qci_type='Wantedpages'; > +-------------+----------------+ > | qci_type | qci_timestamp | > +-------------+----------------+ > | Wantedpages | 20170912074914 | > +-------------+----------------+ > 1 row in set (0.04 sec) > > This table, querycache, could be replicated to the wiki replicas, but > we need someone to help with the sanitization (remove the unwatched > pages info). > > On Sat, Sep 23, 2017 at 2:40 AM, John <phoenixoverr...@gmail.com> wrote: > > The query I am using in this case is: > > > > select pl_title, count(*) from pagelinks left join page on pl_title = > > page_title where pl_namespace = 0 and pl_from_namespace = 0 and > page_id > > IS NULL group by pl_title having COUNT(*) > 100 LIMIT 50; > > > > I am also running this on enwiki where the table is probably the > largest. I > > am attempting to locate red linked pages via number of incoming links. I > > would expect it to take an hour or more to run probably, given the size > of > > the table. But I am getting the error message in the phab ticket: ERROR > 2006 > > (HY000): MySQL server has gone away > > > > On Fri, Sep 22, 2017 at 7:53 PM, Madhumitha Viswanathan > > <mviswanat...@wikimedia.org> wrote: > >> > >> Hi John, > >> > >> Please provide us with a bit more information so we can help - which > host > >> are you connecting to, what long running queries are you trying to run, > etc. > >> I believe you may be running into some timeout/mechanisms that we have > set > >> up to prevent abuse, see - https://phabricator.wikimedia. > org/T76956#948591. > >> > >> As always, feel free to reach out to us on #wikimedia-cloud for any > >> questions. > >> > >> Thanks, > >> > >> On Fri, Sep 22, 2017 at 1:29 PM, John <phoenixoverr...@gmail.com> > wrote: > >>> > >>> Do we have a way of letting SQL queries run for a day or so? Im using > the > >>> interactive sql <host> and running them that way but the DB host is > going > >>> away in the middle of the process. > >>> > >>> _______________________________________________ > >>> Labs-l mailing list > >>> Labs-l@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/labs-l > >>> > >> > >> > >> > >> -- > >> --Madhu :) > >> > >> _______________________________________________ > >> Labs-l mailing list > >> Labs-l@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/labs-l > >> > > > > > > _______________________________________________ > > Labs-l mailing list > > Labs-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/labs-l > > > > > > -- > Jaime Crespo > <http://wikimedia.org> > > _______________________________________________ > Labs-l mailing list > Labs-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/labs-l >
_______________________________________________ Labs-l mailing list Labs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l