Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Cédric Villemain Thu, 07 Mar 2024 04:26:17 -0800

Hi Nazir,

On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:

On Wed, 6 Mar 2024 at 18:23, Cédric Villemain
<[email protected]> wrote:

The behavior is 100% OK, and in fact it might a bad idea to prefetch
block by block as the result is just to put more pressure on a system if
it is already under pressure.


Though there are use cases and it's nice to be able to do that too at
this per page level.

Yes, I do not know which one is more important, cache more blocks but
create more pressure or create less pressure but cache less blocks.
Also, pg_prewarm is designed to be run at startup so I guess there
will not be much pressure.

autowarm is designed for that purpose but pg_prewarm is free to use whenneeed.

About [1], it's very old statement about resources. And Linux manages a
part of the problem for us here I think [2]:

/*
   * Chunk the readahead into 2 megabyte units, so that we don't pin too much
   * memory at once.
   */
void force_page_cache_ra(....)

Thanks for pointing out the actual code. Yes, it looks like the kernel
is already doing that. I would like to do more testing when you
forward vm_relation functions into pgfincore.



I hope to be able to get back there next week max.

An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
             0 |       32768 |    65536 |        0
         32768 |       32768 |    65536 |        0
         65536 |       32768 |    65536 |        0
         98304 |       32768 |    65536 |        0
        131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
       132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
             0 |       32768 |    65536 |      320
         32768 |       32768 |    65536 |        0
         65536 |       32768 |    65536 |        0
         98304 |       32768 |    65536 |        0
        131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
             0 |       32768 |    65536 |    65536
         32768 |       32768 |    65536 |    65536
         65536 |       32768 |    65536 |    65536
         98304 |       32768 |    65536 |    65536
        131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?


Not sure what initialization is here exactly, in my example with
WILLNEED/DONTNEED there are exactly the same code pattern and syscall
request(s), just the flag is distinct, so initialization cost are
expected to be very similar.

Sorry, there was a miscommunication. I was talking about pg_prewarm's
initialization, meaning if the pg_prewarm is called block by block (by
using generate_series); it will make block_count times initialization
and if it is called by full relation it will just do it once but it
seems that is not the case, see below.

OK.

I'll try to move forward on those vm_relation functions into pgfincore
so it'll be easier to run similar tests and compare.

Thanks, that will be helpful for the testing.

I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?


Maybe the system is overloaded and thus by the time you're done
prefetching tail blocks, the heads ones have been dropped already. So
looping on that leads to similar duration.
If it's already in cache and not removed from it, execution time is
stable. This point (in cache or not) is hard to guess right until you do
check the status, or you ensure to clean it first.

My bad. I was trying to drop buffers from the postgres cache, not from
the kernel cache. See my results now:

patched | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 11.395 ms
$ second_run_full_relation_prefetch [5] -> Time: 0.887 ms

master | prefetch test

$ create_the_data [3]
$ drop_kernel_cache [4]
$ first_run_full_relation_prefetch [5] -> Time: 3208.944 ms
$ second_run_full_relation_prefetch [5] -> Time: 283.905 ms

I did more perf tests about comparison between first and second run
for the prefetch and found this on master:

first run:
- 86.40% generic_fadvise
     - 86.24% force_page_cache_ra
         - 85.99% page_cache_ra_unbounded
             + 37.36% filemap_add_folio
             + 34.14% read_pages
             + 8.31% folio_alloc
             + 4.55% up_read
                 0.77% xa_load

second run:
- 20.64% generic_fadvise
     - 18.64% force_page_cache_ra
         - 17.46% page_cache_ra_unbounded
             + 8.54% xa_load
             2.82% down_read
             2.29% read_pages
             1.45% up_read

So, it looks like the difference between the first and the second run
comes from kernel optimization that does not do prefetch if the page
is already in the cache [6]. Saying that, I do not know the difference
between WILLNEED/DONTNEED and I do not have enough materials to test
it but I guess it is something similar.

Patched: Clearly, only a small part has been read and put into VM duringthe first pass, but still some pages, and the second one probably didnothing at all.Master: Apparently it takes around 3.2 seconds to read all (whichoutlines that the first pass, patched, read few). On the second passit's already in cache, so it goes fast. you're correct. But given itstill required 2803ms, there is something.You may want to test the status with vm_relation_cachestat() [7], it'sin a branch, not main or master. It requires linux 6.5, but allows toget information about memory eviction, which is super handy (and superfast)!

It returns:
 - nr_cache is Number of cached pages
 - nr_dirty is Number of dirty pages
 - nr_writeback is Number of pages marked for writeback
 - nr_evicted is Number of pages evicted from the cache
 - nr_recently_evicted is Number of pages recently evicted from the cache
/*

* A page is recently evicted if its last eviction was recent enoughthat its * reentry to the cache would indicate that it is actively being usedby the

 * system, and that there is memory pressure on the system.
 */

WILLNEED posix fadvise flag leads to what used to be call "prefetch":reading from disk, and put into VM. (it's not as simple, but this is theidea).

DONTNEED flushes from VM.

Might be interesting to compare with prewarm called on each block of therelation, one way to do it with current path is to change the constant:

#define PREWARM_PREFETCH_RANGE    RELSEG_SIZE

RELSEG_SIZE is 131071 IIRC

Here you can set to 1 and you'll have prewarm working on all pages, oneby one, which should be similar to current behavior.In pgfincore I have a "range" parameter for that purpose so end-user canadjust exactly as desired.I was not looking after change to prewarm function parameters but ifit's better...

I did not test read performance but I am planning to do that soon.



Nice, thank you for the effort!

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

[2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

[3]
CREATE EXTENSION pg_prewarm;
drop table if exists foo;
create table foo ( id int, c text) with (autovacuum_enabled=false);
insert into foo select i, repeat('a', 1000) from generate_series(1,10000000)i;

[4] echo 3 | sudo tee /proc/sys/vm/drop_caches

[5] select pg_prewarm('foo', 'prefetch', 'main');

[6] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L232

[7]https://github.com/klando/pgfincore/blob/vm_relation_cachestat/pgfincore--1.3.1--1.4.0.sql#L54


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Reply via email to