Hello list,

What would be the best way of setting up a query that uses lots of memory, in 
order to study the effects of resource usage parameters on the performance of 
the query?

I thought that a query on a large table involving JOINs would be a way, but 
while the backend initially used 1.7 gigs of  memory, it only uses a couple 
dozen megs of memory anymore after I restarted PostgreSQL because the 
checkpointer process had 2.4 gigs of memory in use.  (The table consumes nearly 
1.3 gigs of disk space, and the query returns one billion as the join used 
results in one billion rows.)

The query I was using is this one:

SELECT count(*) FROM pgbench_accounts a JOIN pgbench_accounts b ON (a.abalance 
= b.abalance);

which I ran after initializing the table using pgbench -I -s 100, and 
populating the abalance field with

UPDATE pgbench_accounts SET abalance = aid % 100000 + 1;

The query plan reads,

postgres=# EXPLAIN ANALYZE SELECT count(*) FROM pgbench_accounts a JOIN 
pgbench_accounts b ON (a.abalance = b.abalance);
                                                                        QUERY 
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=22479359.09..22479359.10 rows=1 width=0) (actual 
time=787769.456..787769.456 rows=1 loops=1)
   ->  Merge Join  (cost=3126660.67..19725402.17 rows=1101582767 width=0) 
(actual time=62254.349..621119.691 rows=1000000000 loops=1)
         Merge Cond: (a.abalance = b.abalance)
         ->  Sort  (cost=1563330.33..1588330.33 rows=10000000 width=4) (actual 
time=27706.051..42305.488 rows=10000000 loops=1)
               Sort Key: a.abalance
               Sort Method: external merge  Disk: 136624kB
               ->  Seq Scan on pgbench_accounts a  (cost=0.00..263935.00 
rows=10000000 width=4) (actual time=0.260..11592.979 rows=10000000 loops=1)
         ->  Materialize  (cost=1563330.33..1613330.33 rows=10000000 width=4) 
(actual time=34548.224..209123.713 rows=999999901 loops=1)
               ->  Sort  (cost=1563330.33..1588330.33 rows=10000000 width=4) 
(actual time=34548.202..43988.283 rows=10000000 loops=1)
                     Sort Key: b.abalance
                     Sort Method: external merge  Disk: 136624kB
                     ->  Seq Scan on pgbench_accounts b  (cost=0.00..263935.00 
rows=10000000 width=4) (actual time=0.017..16807.894 rows=10000000 loops=1)
 Total runtime: 787836.841 ms
(13 Zeilen)

The point of the SELECT count(*) (as opposed to something like SELECT a.*) is, 
of course, to make sure we are not measuring the impact of the memory 
consumption of pgbench.

There are sorts involved, which I heard might require working memory but could 
also "spill to disk" as someone put it.  So how do I craft a query that 
actually does use lots of memory?

Cheers,
Holger Friedrich


Reply via email to