On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<[email protected]> wrote:
Hi Bertrand,
Thanks for playing with this!
> Which makes me wonder if using numa_move_pages()/move_pages is the right
> approach. Would be curious to know if you observe the same behavior though.
You are correct, I'm observing identical behaviour, please see attached.
> Forcing the allocation to happen inside a monitoring function is decidedly
> not great.
We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results. In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)
Another thing is that numa_maps(5) won't help us a lot too (not enough
granularity).
> But maybe we could use get_mempolicy() only on "valid" buffers i.e
> ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?
Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:
if(os_page_ptrs[blk2page+j] == 0) {
+ volatile uint64 touch pg_attribute_unused();
os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+ touch = *(uint64 *)os_page_ptrs[blk2page+j];
}
and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.
Frankly speaking I do not know which path to take with this, maybe
that's good enough?
-J.
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 16149
4 | 59
0 | 59
6 | 58
2 | 59
(5 rows)
postgres=# create table xx as select generate_series(1, 1000000);
SELECT 1000000
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 14095
4 | 572
0 | 572
6 | 571
2 | 572
-2 | 2
(6 rows)
postgres=# show shared_buffers ;
shared_buffers
----------------
128MB
(1 row)
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
14121
(1 row)
## and now from 14121:
postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 13439
4 | 46
0 | 48
6 | 47
2 | 47
-2 | 2757
## also, this wont give us detailed addr <-> specific NUMA node info:
postgres@jw-test3:~$ grep --color /dev/zero /proc/14121/numa_maps
7f5dd6004000 interleave:0-7 file=/dev/zero\040(deleted) dirty=6829 mapmax=7
active=0 N0=853 N1=853 N2=855 N3=853 N4=852 N5=854 N6=854 N7=855
kernelpagesize_kB=4