Re: Bypassing cursors in postgres_fdw to enable parallel plans

KENAN YILMAZ Tue, 25 Nov 2025 06:24:51 -0800

Hello Hackers,

I have executed use_cursor = true/false with quick tests.


The test table 't1' has 20,000,000 rows (~2.8 GB in size).

The test query is `EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE
a > 1000;`

The result sums are;

** `postgres_fdw.use_cursor = true` --> 20090.782 ms
** `postgres_fdw.use_cursor = false` --> 24103.994 ms

If you wish to run more advanced and complex test queries, feel free to do
so.

All execution test scenarios are listed below;

--
-- restart with flush caches;
$ pg_ctl -D $PGDATA stop && sudo sync && echo 3 | sudo tee
/proc/sys/vm/drop_caches && rm $PGDATA/log/*.log && pg_ctl -D $PGDATA start
-- Data prep stage
DROP DATABASE IF EXISTS testdb;
CREATE DATABASE testdb;
\c testdb;
DROP TABLE IF EXISTS t1 CASCADE;
CREATE UNLOGGED TABLE t1 (a int, b int, c text, d timestamp);

-- Insert test datas
INSERT INTO t1 SELECT 10 + mod(i, 30), i, md5(i::text) ||
md5((i+1000000)::text) || md5((i+2000000)::text), '2025-01-01
10:00:00'::timestamp + (random() * 31536000) * INTERVAL '1 second' FROM
generate_series(1, 20000000) i;

-- Table maintenance
ALTER TABLE t1 SET LOGGED;
VACUUM (ANALYZE, FULL) t1;

-- FDW prep stage
DROP EXTENSION IF EXISTS postgres_fdw CASCADE;
CREATE EXTENSION postgres_fdw;
CREATE SERVER foreign_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(host '127.0.0.1', port '5432', dbname 'testdb');
CREATE USER MAPPING FOR postgres SERVER foreign_server OPTIONS (user
'postgres');
CREATE FOREIGN TABLE t1fdw (a int, b int, c text, d timestamp) SERVER
foreign_server OPTIONS (table_name 't1');

-- restart with flush caches are applied before all stage executions.

--
-- * Local t1 table EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1 WHERE a > 1000;
                                                         QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=0.00..467803.85 rows=1 width=105) (actual
time=23260.306..23261.591 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=363637
   I/O Timings: shared read=64590.910
   ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1 width=105)
(actual time=23242.279..23242.280 rows=0.00 loops=3)
         Filter: (a > 1000)
         Rows Removed by Filter: 6666667
         Buffers: shared read=363637
         I/O Timings: shared read=64590.910
 Planning:
   Buffers: shared hit=54 read=14 dirtied=1
   I/O Timings: shared read=23.281
 Planning Time: 38.734 ms
 Execution Time: 23269.677 ms
(15 rows)

Time: 23347.716 ms (00:23.348)

--
-- * use_cursor = true (Default) EXPLAIN results
--

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual
time=20074.746..20074.796 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17 dirtied=1
   I/O Timings: shared read=10.696
 Planning Time: 43.852 ms
 Execution Time: 20090.782 ms
(6 rows)

Time: 20169.081 ms (00:20.169)

--> From Log Files
[153045]: line=11 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153045]: line=12 sid=6923fba6.255d5 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query,
postgres.c:1078
[153045]: line=13 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: execute
<unnamed>: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
[153045]: line=14 sid=6923fba6.255d5 tag=DECLARE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
 exec_execute_message, postgres.c:2245
[153045]: line=15 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: FETCH
100 FROM c1
[153045]: line=16 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query,
postgres.c:1078
[153044]: line=5 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG:  00000: duration: 20074.799 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40)
(actual time=20074.746..20074.796 rows=0.00 loops=1)
[153044]: line=6 sid=6923fba4.255d4 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153045]: line=17 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: CLOSE
c1
[153045]: line=18 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query,
postgres.c:1078
[153045]: line=19 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: duration:
20057.543 ms  plan:
        Query Text: DECLARE c1 CURSOR FOR
        SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Seq Scan on t1  (cost=0.00..613637.45 rows=1 width=105) (actual
time=20057.541..20057.541 rows=0.00 loops=1)
          Filter: (a > 1000)
          Rows Removed by Filter: 20000000
[153045]: line=20 sid=6923fba6.255d5 tag=CLOSE CURSOR usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:
 explain_ExecutorEnd, auto_explain.c:437
[153045]: line=21 sid=6923fba6.255d5 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT
TRANSACTION

--
-- * use_cursor = false EXPLAIN results
--

testdb=# SET postgres_fdw.use_cursor = false;

testdb=# EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw WHERE a > 1000;
                                                     QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40) (actual
time=24080.945..24080.956 rows=0.00 loops=1)
 Planning:
   Buffers: shared hit=33 read=17
   I/O Timings: shared read=30.074
 Planning Time: 53.678 ms
 Execution Time: 24103.994 ms
(6 rows)

Time: 24230.548 ms (00:24.231)

--> From Log Files
[153121]: line=11 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: START
TRANSACTION ISOLATION LEVEL REPEATABLE READ
[153121]: line=12 sid=6923fc16.25621 tag=idle usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query,
postgres.c:1078
[153121]: line=13 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG:  00000: execute <unnamed>: SELECT
a, b, c FROM public.t1 WHERE ((a > 1000))
[153121]: line=14 sid=6923fc16.25621 tag=SELECT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION:  exec_execute_message,
postgres.c:2245
[153113]: line=7 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOG:  00000: duration: 24080.958 ms  plan:
        Query Text: EXPLAIN (ANALYZE, BUFFERS) SELECT a,b,c FROM t1fdw
WHERE a > 1000;
        Foreign Scan on t1fdw  (cost=100.00..215.67 rows=427 width=40)
(actual time=24080.945..24080.956 rows=0.00 loops=1)
[153113]: line=8 sid=6923fc0c.25619 tag=EXPLAIN usr=postgres db=testdb
app=psql client=[local]: LOCATION:  explain_ExecutorEnd, auto_explain.c:437
[153121]: line=15 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOG:  00000: statement: COMMIT
TRANSACTION
[153121]: line=16 sid=6923fc16.25621 tag=idle in transaction usr=postgres
db=testdb app=postgres_fdw client=127.0.0.1: LOCATION:  exec_simple_query,
postgres.c:1078
[153121]: line=17 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOG:  00000: duration: 24059.372 ms
 plan:
        Query Text: SELECT a, b, c FROM public.t1 WHERE ((a > 1000))
        Gather  (cost=0.00..467803.85 rows=1 width=105) (actual
time=24058.076..24059.367 rows=0.00 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Parallel Seq Scan on t1  (cost=0.00..467803.85 rows=1
width=105) (actual time=24051.406..24051.407 rows=0.00 loops=3)
                Filter: (a > 1000)
                Rows Removed by Filter: 6666667
[153121]: line=18 sid=6923fc16.25621 tag=COMMIT usr=postgres db=testdb
app=postgres_fdw client=127.0.0.1: LOCATION:  explain_ExecutorEnd,
auto_explain.c:437

---

Kenan YILMAZ



Rafia Sabih <[email protected]>, 25 Kas 2025 Sal, 17:01 tarihinde
şunu yazdı:

>
>
> On Fri, 10 Oct 2025 at 22:02, Robert Haas <[email protected]> wrote:
> >
> > On Mon, Sep 29, 2025 at 10:51 AM Rafia Sabih <[email protected]>
> wrote:
> > > I am back at this work with a rebased and revised patch. The new
> version is rebased and has a change in approach.
> > > Whenever we are using non-cursor mode, for the first cursor we are
> always saving the tuples
> > > in the tuplestore, this is because we do not have any means to know
> beforehand how many cursors are required for the query.
> >
> > This might have the advantage of being simpler, but it's definitely
> > worse. If we're only fetching one result set, which will be common,
> > we'll buffer the whole thing in a tuplestore where that could
> > otherwise be avoided. Maybe it's still best to go with this approach;
> > not sure.
> >
> > > And when we switch to the next query then we do not have a way to
> fetch the tuples for the previous query.
> > > So, the tuples retrieved earlier for the first query were lost if not
> saved.
> > > I would highly appreciate your time and feedback for this.
> >
> > My suggestions are to work on the following areas:
> >
> > 1. Automated testing. The patch has no regression tests, and won't get
> > committed without those.
> >
> > 2. Manual testing. How does the performance with this new option
> > compare to the existing method? The answer might be different for
> > small result sets that fit in memory and large ones that spill to
> > disk, and will certainly also depend on how productive parallel query
> > can be on the remote side; but I think you want to do and post on this
> > list some measurements showing the best and worst case for the patch.
> >
> > 3. Patch clean-up. There are plenty of typos and whitespace-only
> > changes in the patch. It's best to clean those up. Running pgindent is
> > a good idea too. Some places could use comments.
> >
> > --
> > Robert Haas
> > EDB: http://www.enterprisedb.com
>
> Thank you Robert, for reviewing this patch. On going through the patch
> more, I realised this was not equipped to handle the cases when there are
> more than two active cursors. So to accommodate such a case, I now modified
> the new struct for saving the previous query to a list of such structs.
> Also, it turns out we need not to save the tuples in case this is an active
> cursor, so we only populate the associated tuplestore only when we need to
> create a new cursor when the old cursor is not completely done.
>
> In this version I have added the regression tests as well. I wanted to
> test this patch for all the cases of postgres_fdw, the only way I could
> figure out how to do this was to test the select statements with the new
> GUC.
>
> I also did some tests for performance. I used the contrib_regression
> database and populated the table "S1"."T1" with more tuples to understand
> the impact of patch on higher scale. I also used auto_explain to get the
> foreign plans.
>
> contrib_regression=# select count(*) from ft1;
> 2025-11-14 14:40:35.825 CET [44338] LOG:  duration: 61.336 ms  plan:
> Query Text: select count(*) from ft1;
> Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
>   Relations: Aggregate on (ft1)
> 2025-11-14 14:40:35.825 CET [44862] LOG:  duration: 60.575 ms  plan:
> Query Text: DECLARE c1 CURSOR FOR
> SELECT count(*) FROM "S 1"."T 1"
> Aggregate  (cost=21888.22..21888.23 rows=1 width=8)
>   ->  Seq Scan on "T 1"  (cost=0.00..19956.98 rows=772498 width=0)
>  count
> --------
>  990821
> (1 row)
>
> Time: 62.728 ms
> contrib_regression=# SET postgres_fdw.use_cursor = false;
> SET
> Time: 1.515 ms
> contrib_regression=# select count(*) from ft1;
> 2025-11-14 14:40:46.260 CET [44862] LOG:  duration: 21.875 ms  plan:
> Query Text: SELECT count(*) FROM "S 1"."T 1"
> Finalize Aggregate  (cost=17255.64..17255.65 rows=1 width=8)
>   ->  Gather  (cost=17255.43..17255.64 rows=2 width=8)
>         Workers Planned: 2
>         ->  Partial Aggregate  (cost=16255.43..16255.44 rows=1 width=8)
>               ->  Parallel Seq Scan on "T 1"  (cost=0.00..15450.74
> rows=321874 width=0)
> 2025-11-14 14:40:46.260 CET [44338] LOG:  duration: 22.623 ms  plan:
> Query Text: select count(*) from ft1;
> Foreign Scan  (cost=102.50..123.72 rows=1 width=8)
>   Relations: Aggregate on (ft1)
>  count
> --------
>  990821
> (1 row)
>
> Time: 24.862 ms
>
> So for this query, the advantage is coming from parallel query which was
> otherwise not possible in this scenario.
> To study more performance with this patch, I found another interesting
> query and here are the results,
>
> contrib_regression=# SET postgres_fdw.use_cursor = true;
> SET
> contrib_regression=# explain  (analyse, buffers)  SELECT t1."C 1" FROM "S
> 1"."T 1" t1 left join ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 =
> t1."C 1");2025-11-14 15:57:46.893 CET [1946]
>                                                               QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Hash Left Join  (cost=109013.50..131877.35 rows=772498 width=4) (actual
> time=112311.578..112804.516 rows=990821.00 loops=1)
>    Hash Cond: (t1."C 1" = t3.c1)
>    Buffers: shared hit=12232, temp read=12754 written=12754
>    ->  Seq Scan on "T 1" t1  (cost=0.00..19956.98 rows=772498 width=4)
> (actual time=0.039..48.808 rows=990821.00 loops=1)
>          Buffers: shared hit=12232
>    ->  Hash  (cost=109001.00..109001.00 rows=1000 width=4) (actual
> time=112310.386..112310.387 rows=990821.00 loops=1)
>          Buckets: 262144 (originally 1024)  Batches: 8 (originally 1)
>  Memory Usage: 6408kB
>          Buffers: temp written=2537
>          ->  Nested Loop  (cost=200.43..109001.00 rows=1000 width=4)
> (actual time=0.728..112030.241 rows=990821.00 loops=1)
>                ->  Foreign Scan on ft1 t2  (cost=100.00..331.00 rows=1000
> width=4) (actual time=0.398..710.505 rows=990821.00 loops=1)
>                ->  Foreign Scan on ft2 t3  (cost=100.43..108.66 rows=1
> width=4) (actual time=0.082..0.082 rows=1.00 loops=990821)
>  Planning:
>    Buffers: shared hit=5
>  Planning Time: 2.211 ms
>  Execution Time: 112825.428 ms
> (15 rows)
>
> contrib_regression=# SET postgres_fdw.use_cursor = false;
> SET
> explain  (analyse, buffers)  SELECT t1."C 1" FROM "S 1"."T 1" t1 left join
> ft1 t2 join ft2 t3 on (t2.c1 = t3.c1) on (t3.c1 = t1."C 1");
> QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------
>  Hash Left Join (cost=109013.50..131877.35 rows=772498 width=4) (actual
> time=261.416..354.520 rows=990821.00 loops=1)
>   Hash Cond: (t1."C 1" = t3.c1)
>   Buffers: shared hit=12232, temp written=2660
>   -> Seq Scan on "T 1" t1 (cost=0.00..19956.98 rows=772498 width=4)
> (actual time=0.021..35.531 rows=990821.00 loops=1)
>      Buffers: shared hit=12232
>   -> Hash (cost=109001.00..109001.00 rows=1000 width=4) (actual
> time=261.381..261.383 rows=100.00 loops=1)
>      Buckets: 1024 Batches: 1 Memory Usage: 12kB
>      Buffers: temp written=2660
>      -> Nested Loop (cost=200.43..109001.00 rows=1000 width=4) (actual
> time=255.563..261.356 rows=100.00 loops=1)
>         Buffers: temp written=2660
>         -> Foreign Scan on ft1 t2 (cost=100.00..331.00 rows=1000 width=4)
> (actual time=0.433..0.443 rows=100.00 loops=1)
>         -> Foreign Scan on ft2 t3 (cost=100.43..108.66 rows=1 width=4)
> (actual time=2.609..2.609 rows=1.00 loops=100)
>            Buffers: temp written=2660
>  Planning:
>   Buffers: shared hit=5
>  Planning Time: 2.284 ms
>  Execution Time: 389.358 ms
> (17 rows)
>
> So even in the case without a parallel plan, it is performing
> significantly better. I investigated a bit more to find out why the query
> was so slow with the cursors,
> and came to understand that it is repeatedly abandoning and recreating the
> cursor via the code path of postgresReScanForeignScan.
> So, it looks like cursor management itself costs this much more time.
>
> To understand the impact of the patch in case of tuples spilled to disk, I
> tried following example,
> contrib_regression=# SET postgres_fdw.use_cursor = true;
> SET
> contrib_regression=# set enable_hashjoin = off;
> SET
> contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
> where a.c1 = b.c1 order by a.c2;
>                                                                QUERY PLAN
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=831.49..833.99 rows=1000 width=94) (actual
> time=4537.437..4598.483 rows=990821.00 loops=1)
>    Sort Key: a.c2
>    Sort Method: external merge  Disk: 137768kB
>    Buffers: temp read=83156 written=83253
>    ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual
> time=3748.488..4090.547 rows=990821.00 loops=1)
>          Merge Cond: (a.c1 = b.c1)
>          Buffers: temp read=48725 written=48779
>          ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual
> time=1818.521..1865.792 rows=990821.00 loops=1)
>                Sort Key: a.c1
>                Sort Method: external merge  Disk: 75664kB
>                Buffers: temp read=18910 written=18937
>                ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000
> width=47) (actual time=1.302..1568.640 rows=990821.00 loops=1)
>          ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual
> time=1929.955..1981.104 rows=990821.00 loops=1)
>                Sort Key: b.c1
>                Sort Method: external sort  Disk: 79520kB
>                Buffers: temp read=29815 written=29842
>                ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000
> width=47) (actual time=0.528..1553.249 rows=990821.00 loops=1)
>  Planning Time: 0.479 ms
>  Execution Time: 4661.872 ms
> (19 rows)
>
> contrib_regression=# SET postgres_fdw.use_cursor = false;
> SET
> contrib_regression=# explain (analyse, buffers)select * from ft1 a, ft1 b
> where a.c1 = b.c1 order by a.c2;
>                                                               QUERY PLAN
>
>
> ---------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=831.49..833.99 rows=1000 width=94) (actual
> time=3376.385..3435.406 rows=990821.00 loops=1)
>    Sort Key: a.c2
>    Sort Method: external merge  Disk: 137768kB
>    Buffers: temp read=83156 written=83253
>    ->  Merge Join  (cost=761.66..781.66 rows=1000 width=94) (actual
> time=2565.517..2916.814 rows=990821.00 loops=1)
>          Merge Cond: (a.c1 = b.c1)
>          Buffers: temp read=48725 written=48779
>          ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual
> time=1249.517..1300.132 rows=990821.00 loops=1)
>                Sort Key: a.c1
>                Sort Method: external merge  Disk: 75664kB
>                Buffers: temp read=18910 written=18937
>                ->  Foreign Scan on ft1 a  (cost=100.00..331.00 rows=1000
> width=47) (actual time=1.651..980.740 rows=990821.00 loops=1)
>          ->  Sort  (cost=380.83..383.33 rows=1000 width=47) (actual
> time=1315.990..1369.576 rows=990821.00 loops=1)
>                Sort Key: b.c1
>                Sort Method: external sort  Disk: 79520kB
>                Buffers: temp read=29815 written=29842
>                ->  Foreign Scan on ft1 b  (cost=100.00..331.00 rows=1000
> width=47) (actual time=0.426..970.728 rows=990821.00 loops=1)
>  Planning Time: 0.491 ms
>  Execution Time: 3527.457 ms
> (19 rows)
>
> So it doesn't hurt in this case either. But that is because not the
> tuplestore which is added in this patch is spilling to disk here.
> I am working on finding a case when there are two or more active cursors
> and then when storing the tuples of one cursor,
> they are spilled onto the disk. That might give a picture of the worst
> case scenario of this patch.
>
> Also, thanks to Kenan, a fellow hacker who finds this work interesting and
> offered to do some performance analysis for this patch,
> maybe he can also post more results here.
>
> --
> Regards,
> Rafia Sabih
> CYBERTEC PostgreSQL International GmbH
>

Re: Bypassing cursors in postgres_fdw to enable parallel plans

Reply via email to