On Sun, Sep 7, 2014 at 8:11 AM, Sean Pringle <[email protected]> wrote:

>
> On Sun, Sep 7, 2014 at 5:58 AM, Brad Jorsch (Anomie) <
> [email protected]> wrote:
>
>> The database query for that is simple enough:
>>
>> SELECT /* ApiQueryCategoryMembers::run Anomie */
>> cl_from,cl_sortkey,cl_type,page_namespace,page_title,cl_timestamp  FROM
>> `page`,`categorylinks`  FORCE INDEX (cl_timestamp) WHERE cl_to =
>> 'Copy_to_Wikimedia_Commons_(bot-assessed)' AND (cl_from=page_id)  ORDER BY
>> cl_timestamp,cl_from LIMIT 501;
>>
>> And the PHP code doesn't do anything complicated either. Maybe Sean can
>> give us more insight if there's some subtle database thing going on here.
>>
>
> As Nik noted, the query plan walking cl_timestamp is not ideal. Plus, even
> with the forced index the query requires a filesort since cl_timestamp
> index is on (cl_to,cl_timestamp) and not (cl_timestamp,cl_from).
>

We're including a constant cl_to in the query here, so the index on
(cl_to,cl_timestamp) is exactly what we want.

As for ORDER BY cl_timestamp, cl_from, that's
https://gerrit.wikimedia.org/r/#/c/103589/ taking advantage of InnoDB's
clustered indexes where it silently appends the primary key (or in this
case the first/only UNIQUE key, cl_from) to all other indexes.

When I EXPLAIN this query against enwiki, there's no filesort on master,
db1055, db1051, and db1066.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_timestamp
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 635858
       [Extra] => Using index condition; Using where
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

There is on db1061, db1062, db1065, db1072, and db1073.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_timestamp
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 706656
       [Extra] => Using index condition; Using where; Using filesort
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

My wild guess would be that the latter set of databases were somehow
created differently so that InnoDB is clustering using some index other
than cl_from.

>
> Removing the FORCE INDEX would allow cl_sortkey index to be used, with
> better selectivity.
>

On master, db1055, db1051, and db1066, removing the FORCE INDEX still
reports from EXPLAIN that it chose the cl_timestamp index. It does cause
EXPLAIN to stop saying "Using index condition" though.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_from,cl_timestamp,cl_sortkey
       [key] => cl_timestamp
       [key_len] => 257
       [ref] => const
       [rows] => 525652
       [Extra] => Using where
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

On db1061, db1062, db1065, db1072, and db1073, it does choose cl_sortkey
but it still filesorts.

   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => categorylinks
       [type] => ref
       [possible_keys] => cl_from,cl_timestamp,cl_sortkey
       [key] => cl_sortkey
       [key_len] => 257
       [ref] => const
       [rows] => 525652
       [Extra] => Using index condition; Using where; Using filesort
   )
   stdClass Object
   (
       [id] => 1
       [select_type] => SIMPLE
       [table] => page
       [type] => eq_ref
       [possible_keys] => PRIMARY
       [key] => PRIMARY
       [key_len] => 4
       [ref] => enwiki.categorylinks.cl_from
       [rows] => 1
       [Extra] =>
   )

If I also remove the cl_from from the ORDER BY, then all databases return
the same query using the cl_timestamp index. But the API can't do that
without reopening https://bugzilla.wikimedia.org/show_bug.cgi?id=24782.


-- 
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
_______________________________________________
Mediawiki-api mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Reply via email to