Re: [PERFORM]

Brad DeJong Wed, 28 Jun 2017 06:11:31 -0700


On 2017-06-28, Pavel Stehule wrote ...
> On 2017-06-28, Yevhenii Kurtov wrote ...
>> On 2017-06-28, Pavel Stehule wrote ...
>>> On 2017-06-28, Yevhenii Kurtov wrote ...
>>>> We have a query that is run almost each second and it's very important to 
>>>> squeeze every other ms out of it. The query is:
>>>> ...
>>>> I added following index: CREATE INDEX ON campaign_jobs(id, status, 
>>>> failed_at, started_at, priority DESC, times_failed);
>>>> ...
>>> There are few issues 
>>> a) parametrized LIMIT
>>> b) complex predicate with lot of OR 
>>> c) slow external sort
>>>
>>> b) signalize maybe some strange in design .. try to replace "OR" by "UNION" 
>>> query
>>> c) if you can and you have good enough memory .. try to increase work_mem 
>>> .. maybe 20MB
>>>
>>> if you change query to union queries, then you can use conditional indexes
>>>
>>> create index(id) where status = 0;
>>> create index(failed_at) where status = 2;
>>> create index(started_at) where status = 1;
>>
>> Can you please give a tip how to rewrite the query with UNION clause?
>
> SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT $7
> FOR UPDATE SKIP LOCKED



Normally (at least for developers I've worked with), that kind of query 
structure is used when the "status" values don't overlap and don't change from 
query to query. Judging from Pavel's suggested conditional indexes (i.e. "where 
status = <constant>"), he also thinks that is likely.

Give the optimizer that information so that it can use it. Assuming $1 = 0 and 
$3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in limit. 
Push order by and limit to each branch of the union all (or does Postgres 
figure that out automatically?) Replace union with union all (not sure about 
Postgres, but allows other dbms to avoid sorting and merging result sets to 
eliminate duplicates). (Use of UNION ALL assumes that "id" is unique across 
rows as implied by only "id" being selected with FOR UPDATE. If multiple rows 
can have the same "id", then use UNION to eliminate the duplicates.)

SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
  UNION ALL
SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
ORDER BY "priority" DESC, "times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED


Another thing that you could try is to push the ORDER BY and LIMIT to the 
branches of the UNION (or does Postgres figure that out automatically?) and use 
slightly different indexes. This may not make sense for all the branches but 
one nice thing about UNION is that each branch can be tweaked independently. 
Also, there are probably unmentioned functional dependencies that you can use 
to reduce the index size and/or improve your match rate. Example - if status = 
1 means that the campaign_job has started but not failed or completed, then you 
may know that started_at is set, but failed_at and ended_at are null. The < 
comparison in and of itself implies that only rows where "started_at" is not 
null will match the condition.

SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND NOT 
(c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND 
(c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
UNION ALL
SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND 
(c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT 100
ORDER BY c0."priority" DESC, c0."times_failed"
LIMIT 100
FOR UPDATE SKIP LOCKED

Including the "priority", "times_failed" and "id" columns in the indexes along 
with "failed_at"/"started_at" allows the optimizer to do index only scans. (May 
still have to do random I/O to the data page to determine tuple version 
visibility but I don't think that can be eliminated.)

create index ... ("priority" desc, "times_failed", "id")               where 
"status" = 0;
create index ... ("priority" desc, "times_failed", "id", "failed_at")  where 
"status" = 2 and "failed_at" is not null;
create index ... ("priority" desc, "times_failed", "id", "started_at") where 
"status" = 1 and "started_at" is not null; -- and ended_at is null and ...


I'm assuming that the optimizer knows that "where status = 1 and started_at < 
$3" implies "and started_at is not null" and will consider the conditional 
index. If not, then the "and started_at is not null" needs to be explicit.

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM]

Reply via email to