Hello
I am working on a potentially large database table, let's call it
"observation", that has a foreign key to table "measurement". Each measurement
is associated with either none or around five observations. In this kind of
situation, it is well known that the statistics on the foreign key column in
observation table can get arbitrarily bad as the row count increases.
Especially, the estimate of the number of distinct values in the foreign key
column can be completely off.
To combat this issue I have set n_distinct=-0.2 on the foreign key column. With
this, the query planner gets good estimates of row counts, but it would appear
that this setting does not affect the cost estimate of an index scan. With
this, I get odd cost estimates, as if fetching these approximately five rows
using index scan would take hundreds of random disk reads. Due to this high
cost estimate, when joining these two tables, PostgreSQL changes from using
multiple small index scans to scanning the whole table a lot earlier than would
be beneficial.
So, in more detail:
I am using PostgreSQL 9.2.1 on Windows 7 SP 1 installed with EnterpriseDB
one-click installer and with default settings.
SELECT version();
PostgreSQL 9.2.1, compiled by Visual C++ build 1600, 64-bit
I have set up a simple testing database as follows (running these commands
takes around an hour on my DB. It's slow, sorry, but I need lots of rows to
show the issue):
CREATE TABLE observation
(
id bigserial NOT NULL,
measurement_id bigint NOT NULL,
CONSTRAINT observation_pkey PRIMARY KEY (id)
);
CREATE INDEX observation_measurement_id_idx
ON observation
USING btree
(measurement_id);
INSERT INTO observation
SELECT x as id,
CASE WHEN (x - 1) % 15 < 3 THEN ((x - 1) / 15) * 4
WHEN (x - 1) % 15 < 8 THEN ((x - 1) / 15) * 4 + 1
ELSE ((x - 1) / 15) * 4 + 2
END AS measurement_id
FROM (SELECT generate_series(1, 100000000) AS x) AS series;
ANALYZE observation;
Here the measurement_id stands for the foreign key to table "measurement".
Actually having that table is not required to show the issue, though. Each
number from range 1...26666666 (1e8 * 4 / 15) appears 0, 3, 5 or 7 times in
measurement_id column.
After this, the statistics on measurement_id column look something like this:
Distinct Values 1.23203e+006
Most Common Values {6895590, 8496970, 23294094, 75266, 128877, 150786, 175001,
192645, 216918, 262742, ...
Most Common Frequencies {0.0001, 0.0001, 0.0001, 6.66667e-005, 6.66667e-005,
6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, 6.66667e-005, ...
Histogram Bounds {14, 208364, 511400, 840725, 1091642, 1392585, 1713998,
1945482, 2204897, 2476654, ...
Correlation 1
As there are 20e6 distinct values, the 1.2e6 estimate is already somewhat off
and will keep on going worse if more rows are added.
Let's have a look on the query plan of fetching all observations of a single
measurement:
EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
SELECT id, measurement_id
FROM observation
WHERE measurement_id = 200001;
Index Scan using observation_measurement_id_idx on public.observation
(cost=0.00..119.36 rows=82 width=16) (actual time=0.060..0.062 rows=5 loops=1)
Output: id, measurement_id
Index Cond: (observation.measurement_id = 200001)
Buffers: shared read=5
Total runtime: 0.081 ms
The row estimate is off by a factor of 10, so let's help the planner and tell
how many rows it is likely to find:
ALTER TABLE observation
ALTER COLUMN measurement_id
SET (n_distinct=-0.2);
ANALYZE observation;
This doesn't change the row statistics noticeably, except for changing the
number of distinct values.
EXPLAIN (ANALYZE on, VERBOSE on, COSTS on, BUFFERS on, TIMING on )
SELECT id, measurement_id
FROM observation
WHERE measurement_id = 200001;
Index Scan using observation_measurement_id_idx on public.observation
(cost=0.00..118.01 rows=5 width=16) (actual time=0.060..0.061 rows=5 loops=1)
Output: id, measurement_id
Index Cond: (observation.measurement_id = 200001)
Buffers: shared read=5
Total runtime: 0.073 ms
So, the row count estimate is now good, but the cost estimate has not changed
much. If I halve the random_page_cost (from 4 to 2), it nearly exactly halves
the estimated cost, so it would appear this cost is almost completely from
random page accesses, approximately 29 of them. In the original plan this made
sense, as it expected to fetch 81 rows, but for five rows it would seem quite
excessive.
Enlarging the table from 100 million to 200 million rows almost doubles the
estimated cost (from 118.01 to 227.69). Since the estimated and actual count of
returned rows stay the same and the query is using a B tree, I'd expect the
cost to rise only slightly.
I have a distinct feeling that this is either a bug in the cost estimator or
there's a quite valid reason which I have missed. Regardless, any insights you
might have to this issue would be appreciated.
--
Niko Kiirala
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance