Re: [Labs-l] Query optimization

Purodha Blissenbach Thu, 10 Mar 2016 00:26:07 -0800

I do not know how replication works, but I did and do assume, it
replicates indexes. If so, the online manual starting at
https://www.mediawiki.org/wiki/Manual:Database_layout
may help you.


Purodha

On 10.03.2016 05:04, Huji Lee wrote:

@John: I dont know of a way, through SQL, to query the cachedcontains

of that special page. Is there a way, or did you mean I should copy
data from that special page into a temp table and join from there?

Also, I am afraid most of the top 500 do have counterparts in FA WP.
It is the lower ranks that we are interested in; I guess they fall in
top 2000-3000.

@Jaime: The query which I claimed was slow actually never finished on
Quarry; it got killed. The solution you propose (divide and concur)
might be my only option on Quarry. But I was hoping that wouldnt be
the case.

I guess I would seize this opportunity to ask two questions:

1) For people like me who dont have SELECT access to the original
tables (what we query is really VIEWs based on the true tables,
right?) and therefore cannot see what fields are indexed, is there a
place were we can refer to to see the details of indexes available to
us for each table?

2) Even a simple query like below takes forever to run. Assuming that
ll_from is indexed (which I think is the case), what makes it so
expensive? Shouldnt a simple join on indexed fields + a simple
aggregation on an indexed field be relatively cheap?

page_title, count(cl_to) as CNT from page join categorylinks on
page_title = cl_to where page_namespace = 14 group by cl_to;

If anyone reading this has the ability to run EXPLAIN on the above
query and share the results, that would be fantastic.

On Wed, Mar 9, 2016 at 3:17 AM, Jaime Crespo  wrote:

I would recommend starting with the appropriate Special page:
[[Special:MostLinkedCategories]]
That generates the 500 categories with most members and is
refreshed
every few days.

If that wasnt enough, I would do:

 SELECT max(cat_id) FROM category;

to get the last cat_id, and then, in small, manageable
transactions, run:

 SELECT page_title, cat_pages, cat_subcats, page_id
 FROM category JOIN page ON category.cat_title = page.page_title
AND
page.page_namespace = 14
 LEFT JOIN langlinks ON page_id = ll_from and ll_lang = fa
 WHERE cat_id BETWEEN ? and ? and cat_pages > 500 and ll_from is
null;

where the first 2 "?" are small ranges to cover, in small
increments
(e.g. 10000 records each time), between 1 and max(cat_id). Change
"cat_pages > 500" to whatever fits you.

Double check the where/on s, not 100% sure they are right, but it
should be close to what you want.

On Wed, Mar 9, 2016 at 4:38 AM, John wrote:
> I dont have it handy but category membership counts are stored in
the db
> use that instead of a count()
>
>
> On Tuesday, March 8, 2016, Huji Lee wrote:
>>
>> I am trying to find a list of those categories on EN WP that are
highly
>> used (more than 100 subpages or subcats) but dont have a
counterpart in FA
>> WP. The query I used is shown below and also on
>> https://quarry.wmflabs.org/query/7943 [3] and is extremely slow.
>>
>> Any thoughts on how to make it more efficient?
>>
>> select page_title, ll_title, count(cl_to) as CNT
>> from page
>> left join langlinks
>> on ll_from = page_id
>> and ll_lang = fa
>> join categorylinks
>> on page_title = cl_to
>> where page_namespace = 14
>> and ll_from is null
>> group by cl_to, ll_title
>> having CNT > 100
>> order by CNT desc;
>>
>>
>
> _______________________________________________
> Labs-l mailing list
> [email protected] [4]
> https://lists.wikimedia.org/mailman/listinfo/labs-l [5]
>

--
Jaime Crespo

_______________________________________________
Labs-l mailing list
[email protected] [7]
https://lists.wikimedia.org/mailman/listinfo/labs-l [8]




Links:
------
[1] mailto:[email protected]
[2] mailto:[email protected]
[3] https://quarry.wmflabs.org/query/7943
[4] mailto:[email protected]
[5] https://lists.wikimedia.org/mailman/listinfo/labs-l
[6] http://wikimedia.org
[7] mailto:[email protected]
[8] https://lists.wikimedia.org/mailman/listinfo/labs-l
[9] mailto:[email protected]



_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Re: [Labs-l] Query optimization

Reply via email to