Re: Query planning read a large amount of buffers for partitioned tables

bruno vieira da silva Mon, 17 Feb 2025 12:18:53 -0800

Hello, I did a more comprehensive test with a different number of
partitions and I found this:


Summary buffers usage for the first call vs second call on the same session.

Query 200, 100, 50, and 10 partitions:
200 Partitions: 12,828 (100MB)
100 Partitions:  9,329 (72MB)
 50 Partitions:  3,305 (25MB)
 10 Partitions:    875 (7MB)

Same query on the same session:
200 Partitions:    205 (1.6MB)
100 Partitions:      5 (40KB)
50  Partitions:      5 (40KB)
10  Partitions:      5 (40KB)

I did test on PG 17.3 no relevant changes.

Question is, does it make sense?

*these are the steps to reproduce it:*

docker pull postgres:17.2
docker run -itd -e POSTGRES_USER=bruno -e POSTGRES_PASSWORD=bruno -p
5500:5432 -v /home/bruno/pgdata17:/var/lib/postgresql/data --name
postgresql postgres:17.2
export PGHOST="localhost"
export PGPORT=5500
export PGDATABASE="postgres"
export PGUSER="bruno"
export PGPASSWORD="bruno"

CREATE EXTENSION IF NOT EXISTS "pgcrypto"; -- Enables the gen_random_uuid
function

CREATE TABLE dicom_series (
    series_uid UUID DEFAULT gen_random_uuid(),
    series_description VARCHAR(255),
    modality VARCHAR(16),
    body_part_examined VARCHAR(64),
    patient_id VARCHAR(64),
    study_uid UUID DEFAULT gen_random_uuid(),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create the parent table
CREATE TABLE dicom_sops_100_part (
    sop_uid UUID NOT NULL,
    series_uid UUID NOT NULL,
    instance_number INT,
    image_position_patient TEXT,
    image_orientation_patient TEXT,
    slice_thickness DECIMAL(10, 2),
    slice_location DECIMAL(10, 2),
    pixel_spacing TEXT,
    rows INT,
    columns INT,
    acquisition_date DATE,
    acquisition_time TIME,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) PARTITION BY HASH (sop_uid);

-- Create 100 partitions
DO $$
DECLARE
    partition_number INT;
BEGIN
    FOR partition_number IN 0..99 LOOP
        EXECUTE format(
            'CREATE TABLE dicom_sops_100_p%1$s PARTITION OF
dicom_sops_100_part FOR VALUES WITH (MODULUS 100, REMAINDER %1$s);',
            partition_number
        );
    END LOOP;
END $$;

*Data population:*

DO $$
DECLARE
    series_count INT := 1000000; -- Number of series to create
    sops_per_series INT := 20;
    i INT;
    j INT;
    series_id UUID;
    sop_id UUID;
BEGIN
    FOR i IN 1..series_count LOOP
        -- Insert into dicom_series table with a generated UUID
        INSERT INTO dicom_series (
            series_description,
            modality,
            body_part_examined,
            patient_id
        ) VALUES (
            'Series Description ' || i,
            'CT',
            'Chest',
            'PATIENT-' || i
        )
        RETURNING series_uid INTO series_id;

        FOR j IN 1..sops_per_series LOOP
            -- Insert into dicom_sops_200_part table with a generated UUID
            INSERT INTO dicom_sops_100_part (
                sop_uid,
                series_uid,
                instance_number,
                image_position_patient,
                image_orientation_patient,
                slice_thickness,
                slice_location,
                pixel_spacing,
                rows,
                columns,
                acquisition_date,
                acquisition_time
            ) VALUES (
                gen_random_uuid(),
                series_id,
                j,
                '(0.0, 0.0, ' || j || ')',
                '(1.0, 0.0, 0.0, 0.0, 1.0, 0.0)',
                1.0,
                j * 5.0,
                '1.0\\1.0',
                512,
                512,
                CURRENT_DATE,
                CURRENT_TIME
            );
        END LOOP;
    END LOOP;
END $$;

*Add indexes and vacuum analyze:*

CREATE UNIQUE INDEX idx_series_uid ON dicom_series(series_uid);
CREATE INDEX dicom_sops_100_part_sop_uid_idx ON
dicom_sops_100_part(sop_uid);
CREATE INDEX dicom_sops_100_part_series_uid_idx ON
dicom_sops_100_part(series_uid);

vacuum freeze;
analyze;

*Testing:*
disconnect and reconnect to the db with psql.

Query used for test:

drop table temp_series_id;CREATE TEMPORARY TABLE temp_series_id AS select
series_uid from dicom_series order by random() limit 1; analyze
temp_series_id;
explain (analyze,buffers) select * from dicom_sops_100_part where
series_uid = (select series_uid from temp_series_id);

Query plan:


         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=1.43..423.26 rows=50 width=128) (actual time=2.565..27.216
rows=20 loops=1)
   Buffers: shared hit=50 read=118, local hit=1
   InitPlan 1
     ->  Seq Scan on temp_series_id  (cost=0.00..1.01 rows=1 width=16)
(actual time=0.006..0.007 rows=1 loops=1)
           Buffers: local hit=1
   ->  Index Scan using dicom_sops_100_p0_series_uid_idx on
dicom_sops_100_p0 dicom_sops_100_part_1  (cost=0.42..8.44 rows=1 width=128)
(actual time=0.846..0.846 rows=0 loops=1)
         Index Cond: (series_uid = (InitPlan 1).col1)
....
   ->  Index Scan using dicom_sops_100_p49_series_uid_idx on
dicom_sops_100_p49 dicom_sops_100_part_50  (cost=0.42..8.44 rows=1
width=128) (actual time=0.302..0.303 rows=0 loops=1)
         Index Cond: (series_uid = (InitPlan 1).col1)
         Buffers: shared hit=1 read=2
 Planning:
   Buffers: shared hit=4180
 Planning Time: 4.941 ms
 Execution Time: 27.682 ms
(159 rows)

Second query on the same session:

         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=1.43..423.26 rows=50 width=128) (actual time=9.759..9.770
rows=0 loops=1)
   Buffers: shared hit=100 read=50, local hit=1
   InitPlan 1
     ->  Seq Scan on temp_series_id  (cost=0.00..1.01 rows=1 width=16)
(actual time=0.003..0.004 rows=1 loops=1)
           Buffers: local hit=1
   ->  Index Scan using dicom_sops_100_p0_series_uid_idx on
dicom_sops_100_p0 dicom_sops_100_part_1  (cost=0.42..8.44 rows=1 width=128)
(actual time=0.212..0.213 rows=0 loops=1)
         Index Cond: (series_uid = (InitPlan 1).col1)
...
   ->  Index Scan using dicom_sops_100_p49_series_uid_idx on
dicom_sops_100_p49 dicom_sops_100_part_50  (cost=0.42..8.44 rows=1
width=128) (actual time=0.236..0.236 rows=0 loops=1)
         Index Cond: (series_uid = (InitPlan 1).col1)
         Buffers: shared hit=2 read=1
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.604 ms
 Execution Time: 10.011 ms
(159 rows)


On Thu, Jan 16, 2025 at 9:56 AM bruno vieira da silva <brunogi...@gmail.com>
wrote:

> Hello, Thanks David.
>
> this pg test deployment. anyways I did a vacuum full on the db. and the
> number of buffers read increased a bit.
>
>
> On Wed, Jan 15, 2025 at 3:01 PM David Rowley <dgrowle...@gmail.com> wrote:
>
>> On Thu, 16 Jan 2025 at 07:29, bruno vieira da silva
>> <brunogi...@gmail.com> wrote:
>> > On pg 17 now we have better visibility on the I/O required during query
>> planning.
>> > so, as part of an ongoing design work for table partitioning I was
>> analyzing the performance implications of having more or less partitions.
>> > In one of my tests of a table with 200 partitions using explain showed
>> a large amount of buffers read during planning. around 12k buffers.
>>
>> That's a suspiciously high number of buffers.
>>
>> > I observed that query planning seems to have a caching mechanism as
>> subsequent similar queries require only a fraction of buffers read during
>> query planning.
>> > However, this "caching" seems to be per session as if I end the client
>> session and I reconnect the same query execution will require again to read
>> 12k buffer for query planning.
>> >
>> > Does pg have any mechanism to mitigate this issue ( new sessions need
>> to read a large amount of buffers for query planning) ? or should I
>> mitigate this issue by the use of connection pooling.
>> > How is this caching done? Is there a way to have viability on its
>> usage? Where is it stored?
>>
>> The caching is for relation meta-data and for various catalogue data.
>> This is stored in local session hash tables. The caching is done
>> lazily the first time something is looked up after the session starts.
>> If you're doing very little work before ending the session, then
>> you'll pay this overhead much more often than you would if you were to
>> do more work in each session. A connection pooler would help you do
>> that, otherwise it would need to be a redesign of how you're
>> connecting to Postgres from your application.
>>
>> There's no easy way from EXPLAIN to see which tables or catalogue
>> tables the IO is occurring on, however, you might want to try looking
>> at pg_statio_all_tables directly before and after the query that's
>> causing the 12k buffer accesses and then look at what's changed.
>>
>> I suspect if you're accessing 12k buffers to run EXPLAIN that you have
>> some auto-vacuum starvation issues. Is auto-vacuum enabled and
>> running? If you look at pg_stat_activity, do you see autovacuum
>> running? It's possible that it's running but not configured to run
>> quickly enough to keep up with demand.  Alternatively, it may be
>> keeping up now, but at some point in the past, it might not have been
>> and you have some bloat either in an index or in a catalogue table as
>> a result.
>>
>> David
>>
>
>
> --
> Bruno Vieira da Silva
>


-- 
Bruno Vieira da Silva

Re: Query planning read a large amount of buffers for partitioned tables

Reply via email to