Re: [HACKERS] Secondary index access optimizations

Antonin Houska Tue, 09 Jan 2018 08:48:28 -0800

Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote:

> On 14.08.2017 19:33, Konstantin Knizhnik wrote:
> 
>  On 14.08.2017 12:37, Konstantin Knizhnik wrote: 
> 
>  Hi hackers, 
> 
>  I am trying to compare different ways of optimizing work with huge 
> append-only tables in PostgreSQL where primary key is something like 
> timestamp and queries are usually accessing most recent data using some 
> secondary keys. Size of secondary index is one of the most critical
>  factors limiting insert/search performance. As far as data is inserted in 
> timestamp ascending order, access to primary key is well localized and 
> accessed tables are present in memory. But if we create secondary key for the 
> whole table, then access to it will require random reads from
>  the disk and significantly decrease performance. 
> 
>  There are two well known solutions of the problem: 
>  1. Table partitioning 
>  2. Partial indexes 
> 
>  This approaches I want to compare. First of all I want to check if optimizer 
> is able to generate efficient query execution plan covering different time 
> intervals. 
>  Unfortunately in both cases generated plan is not optimal. 
> 
>  1. Table partitioning: 
> 
>  create table base (k integer primary key, v integer); 
>  create table part1 (check (k between 1 and 10000)) inherits (base); 
>  create table part2 (check (k between 10001 and 20000)) inherits (base); 
>  create index pi1 on part1(v); 
>  create index pi2 on part2(v); 
>  insert int part1 values (generate series(1,10000), random()); 
>  insert into part2 values (generate_series(10001,20000), random()); 
>  explain select * from base where k between 1 and 20000 and v = 100; 
>  QUERY PLAN 
>  ----------------------------------------------------------------------- 
>  Append (cost=0.00..15.65 rows=3 width=8) 
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) 
>  Filter: ((k >= 1) AND (k <= 20000) AND (v = 100)) 
>  -> Index Scan using pi1 on part1 (cost=0.29..8.31 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  Filter: ((k >= 1) AND (k <= 20000)) 
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  Filter: ((k >= 1) AND (k <= 20000)) 
> 
>  Questions: 
>  - Is there some way to avoid sequential scan of parent table? Yes, it is 
> empty and so sequential scan will not take much time, but ... it still 
> requires some additional actions and so increasing query execution time. 
>  - Why index scan of partition indexes includes filter condition if it is 
> guaranteed by check constraint that all records of this partition match 
> search predicate? 
> 
>  2. Partial indexes: 
> 
>  create table t (k integer primary key, v integer); 
>  insert into t values (generate_series(1,20000),random()); 
>  create index i1 on t(v) where k between 1 and 10000; 
>  create index i2 on t(v) where k between 10001 and 20000; 
>  postgres=# explain select * from t where k between 1 and 10000 and v = 100; 
>  QUERY PLAN 
>  ------------------------------------------------------------ 
>  Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  (2 rows) 
> 
>  Here we get perfect plan. Let's try to extend search interval: 
> 
>  postgres=# explain select * from t where k between 1 and 20000 and v = 100; 
>  QUERY PLAN 
>  ------------------------------------------------------------------ 
>  Index Scan using t_pkey on t (cost=0.29..760.43 rows=1 width=8) 
>  Index Cond: ((k >= 1) AND (k <= 20000)) 
>  Filter: (v = 100) 
>  (3 rows) 
> 
>  Unfortunately in this case Postgres is not able to apply partial indexes. 
>  And this is what I expected to get: 
> 
>  postgres=# explain select * from t where k between 1 and 10000 and v = 100 
> union all select * from t where k between 10001 and 20000 and v = 100; 
>  QUERY PLAN 
>  ---------------------------------------------------------------------- 
>  Append (cost=0.29..14.58 rows=2 width=8) 
>  -> Index Scan using i1 on t (cost=0.29..7.28 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  -> Index Scan using i2 on t t_1 (cost=0.29..7.28 rows=1 width=8) 
>  Index Cond: (v = 100) 
> 
>  I wonder if there are some principle problems in supporting this two things 
> in optimizer: 
>  1. Remove search condition for primary key if it is fully satisfied by 
> derived table check constraint. 
>  2. Append index scans of several partial indexes if specified interval is 
> covered by their conditions. 
> 
>  I wonder if someone is familiar with this part of optimizer ad can easily 
> fix it. 
>  Otherwise I am going to spend some time on solving this problems (if 
> community think that such optimizations will be useful). 
> 
>  Replying to myself: the following small patch removes redundant checks from 
> index scans for derived tables: 
> 
>  diff --git a/src/backend/optimizer/util/plancat.c 
> b/src/backend/optimizer/util/plancat.c 
>  index 939045d..1f7c9cf 100644 
>  --- a/src/backend/optimizer/util/plancat.c 
>  +++ b/src/backend/optimizer/util/plancat.c 
>  @@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root, 
>  if (predicate_refuted_by(safe_constraints, rel->baserestrictinfo, false)) 
>  return true; 
> 
>  + /* 
>  + * Remove from restriction list items implied by table constraints 
>  + */ 
>  + safe_restrictions = NULL; 
>  + foreach(lc, rel->baserestrictinfo) 
>  + { 
>  + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); 
>  + if (!predicate_implied_by(list_make1(rinfo->clause), safe_constraints, 
> false)) { 
>  + safe_restrictions = lappend(safe_restrictions, rinfo); 
>  + } 
>  + } 
>  + rel->baserestrictinfo = safe_restrictions; 
>  + 
>  + 
>  return false; 
>  } 
> 
>  ================================================= 
> 
>  I am not sure if this is the right place of adjusting restriction list and 
> is such transformation always correct. 
>  But for the queries I have tested produced plans are correct: 
> 
>  postgres=# explain select * from base where k between 1 and 20000 and v = 
> 100; 
>  QUERY PLAN 
>  ----------------------------------------------------------------------- 
>  Append (cost=0.00..15.64 rows=3 width=8) 
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) 
>  Filter: ((k >= 1) AND (k <= 20000) AND (v = 100)) 
>  -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  (7 rows) 
> 
>  postgres=# explain select * from base where k between 1 and 15000 and v = 
> 100; 
>  QUERY PLAN 
>  ----------------------------------------------------------------------- 
>  Append (cost=0.00..15.64 rows=3 width=8) 
>  -> Seq Scan on base (cost=0.00..0.00 rows=1 width=8) 
>  Filter: ((k >= 1) AND (k <= 15000) AND (v = 100)) 
>  -> Index Scan using pi1 on part1 (cost=0.29..8.30 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  -> Index Scan using pi2 on part2 (cost=0.29..7.34 rows=1 width=8) 
>  Index Cond: (v = 100) 
>  Filter: (k <= 15000) 
>  (8 rows) 
> 
> There is one more problem with predicate_implied_by function and standard 
> Postgres partitions.
> Right now it specifies constrains for partitions using intervals with open 
> high boundary:
> 
> postgres=# create table bt (k integer, v integer) partition by range (k);
> postgres=# create table dt1 partition of bt for values from (1) to (10001);
> postgres=# create table dt2 partition of bt for values from (10001) to 
> (20001);
> postgres=# create index dti1 on dt1(v);
> postgres=# create index dti2 on dt2(v);
> postgres=# insert into bt values (generate_series(1,20000), random());
> 
> postgres=# \d+ dt2
> Table "public.dt2"
> Column | Type | Collation | Nullable | Default | Storage | Stats target | De
> scription 
> --------+---------+-----------+----------+---------+---------+--------------+---
> ----------
> k | integer | | | | plain | | 
> v | integer | | | | plain | | 
> Partition of: bt FOR VALUES FROM (10001) TO (20001)
> Partition constraint: ((k IS NOT NULL) AND (k >= 10001) AND (k < 20001))
> Indexes:
> "dti2" btree (v)
> 
> If now I run the query with predicate "between 1 and 20000", I still see 
> extra check in the plan:
> 
> postgres=# explain select * from bt where k between 1 and 20000 and v = 100;
> QUERY PLAN 
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> Filter: (k <= 20000)
> (6 rows)
> 
> It is because operator_predicate_proof is not able to understand that k < 
> 20001 and k <= 20000 are equivalent for integer type.


> If I rewrite query as (k >= 1 and k < 20001) then plan is optimal:
> 
> postgres=# explain select * from bt where k >= 1 and k < 20001 and v = 100;
> QUERY PLAN 
> ----------------------------------------------------------------------
> Append (cost=0.29..15.63 rows=2 width=8)
> -> Index Scan using dti1 on dt1 (cost=0.29..8.30 rows=1 width=8)
> Index Cond: (v = 100)
> -> Index Scan using dti2 on dt2 (cost=0.29..7.33 rows=1 width=8)
> Index Cond: (v = 100)
> (5 rows)
> 
> It is fixed by one more patch of predtest.c:

Have you considered using the range types (internally in
operator_predicate_proof()) instead of hard-coding operator OIDs? The range
types do have the knowledge that k < 20001 and k <= 20000 are equivalent for
integer type:

postgres=# SELECT int4range '(, 20001)' = int4range '(, 20000]';
 ?column? 
----------
 t
(1 row)


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Re: [HACKERS] Secondary index access optimizations

Reply via email to