On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <[email protected]> wrote:
> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:
Hi Andres, thanks for your review!
OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.
QQ: I'm still wondering if we there's better way of exposing multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?
postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
name | off | ?column?
| allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
Proc Header | 147114112 |
147114240 | 128
Proc Array | 147274752 |
147275392 | 640
KnownAssignedXids | 147275392 |
147310848 | 35456
KnownAssignedXidsValid | 147310848 |
147319808 | 8960
Backend Status Array | 147319808 |
147381248 | 61440
postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
name | numa_zone_id | numa_size
-------------------+--------------+-----------
KnownAssignedXids | 0 | 2097152
Proc Array | 0 | 2097152
Proc Header | 0 | 2097152
I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node
> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...
OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.
> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.
Added.
> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?
Please see attached file for more verbose results, but in short it is
like below:
patch(-touchpages) hugepages=off INVALID RESULTS (-2)
patch(-touchpages) hugepages=on INVALID RESULTS (-2)
patch(touchpages) hugepages=off CORRECT RESULT
patch(touchpages) hugepages=on CORRECT RESULT
patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2)
IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.
BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.
> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.
Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...
> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)
Done.
> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?
Yes, I had doubts earlier, but for now I'm going to leave it as it is.
> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> > SANITIZER_FLAGS: -fsanitize=address
> > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > + # FIXME: use or not the libnuma?
> > + # --with-libnuma \
> > + #
> > # Normally, the "relation segment" code basically has no coverage in
> > our
> > # tests, because we (quite reasonably) don't generate tables large
> > # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.
Fixed.
> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache. I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.
... it's just me cutting corners :^), fixed now.
[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?
Yup, fixed.
> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> > else
> > fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size)
> > (bufHdr)->buf_id) * BLCKSZ))
> > + blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.
Fixed, thanks I was looking for something like this! Is that +1 in v4 good?
> > + j = 0;
> > + do {
> > + /*
> > + * Many buffers can point to the same page,
> > but we want to
> > + * query just first address.
> > + *
> > + * In order to get reliable results we also
> > need to touch memory pages
> > + * so that inquiry about NUMA zone doesn't
> > return -2.
> > + */
> > + if(os_page_ptrs[blk2page+j] == 0) {
> > + volatile uint64 touch
> > pg_attribute_unused();
> > + os_page_ptrs[blk2page+j] = (char
> > *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > + touch = *(uint64
> > *)os_page_ptrs[blk2page+j];
> > + }
> > + j++;
> > + } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.
Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?
> > +#ifdef USE_LIBNUMA
> > + if(query_numa) {
> > + /* According to numa(3) it is required to initialize
> > library even if that's no-op. */
> > + if(numa_available() == -1) {
> > + pg_buffercache_mark_numa_invalid(fctx,
> > NBuffers);
> > + elog(NOTICE, "libnuma initialization failed,
> > some NUMA data might be unavailable.");;
> > + } else {
> > + /* Amortize the number of pages we need to
> > query about */
> > + if(numa_move_pages(0, os_page_count,
> > os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > + elog(ERROR, "failed NUMA pages
> > inquiry status");
> > + }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.
Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...
[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?
Both fixed.
-J.
IMPACT OF pg_numa_touch_mem_if_required() from pg_numa.h: so we are touching
like ~16k * 8kB buffers =~ 128MB (close to pgbench_accounts size):
postgres=# \dt+ pgbench_accounts
List of tables
Schema | Name | Type | Owner | Persistence | Access
method | Size | Description
--------+------------------+-------+----------+-------------+---------------+--------+-------------
public | pgbench_accounts | table | postgres | permanent | heap
| 128 MB |
pgbench -i -s 10 ~~ 128MB size, and "select relfilenode, count(*) from
pg_buffercache where relfilenode is not null group by relfilenode order by 2
desc limit 3;" shows:
relfilenode | count
-------------+-------
28345 | 16402 // <-- pgbench_accounts
1249 | 46
2619 | 44
EXPECTED RESULT: at least 16k+ buffers split into multiple NUMA zones
patch(-touchpages) hugepages=off INVALID RESULTS (-2)
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;'
[..]
done in 2.26 s (drop tables 0.12 s, create tables 0.01 s, client-side
generate 1.40 s, vacuum 0.18 s, primary keys 0.54 s).
huge_pages_status
-------------------
off
(1 row)
numa_zone_id | count
--------------+--------
| 507568
0 | 98
2 | 98
-2 | 16524
patch(-touchpages) hugepages=on INVALID RESULTS (-2)
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;'
[..]
done in 2.54 s (drop tables 0.20 s, create tables 0.02 s, client-side
generate 1.57 s, vacuum 0.21 s, primary keys 0.55 s).
huge_pages_status
-------------------
on
(1 row)
numa_zone_id | count
--------------+--------
| 507490
0 | 256
2 | 256
-2 | 15774
1 | 512
patch(touchpages) hugepages=off CORRECT RESULT
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;'
[..]
done in 2.42 s (drop tables 0.12 s, create tables 0.01 s, client-side
generate 1.55 s, vacuum 0.19 s, primary keys 0.54 s).
huge_pages_status
-------------------
off
(1 row)
numa_zone_id | count
--------------+--------
| 507589
0 | 8349
2 | 8350
^^ it's interesting (different split without huge with huge pages)
patch(touchpages) hugepages=on CORRECT RESULT
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;' -c '\l+'
[..]
done in 2.21 s (drop tables 0.12 s, create tables 0.01 s, client-side
generate 1.41 s, vacuum 0.19 s, primary keys 0.47 s).
huge_pages_status
-------------------
on
(1 row)
numa_zone_id | count
--------------+--------
| 507602
3 | 4096
0 | 4352
2 | 4096
1 | 4142
^^ it's interesting (different split without huge with huge pages)
patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2)
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;' -c '\l+'
[..]
done in 2.25 s (drop tables 0.13 s, create tables 0.01 s, client-side
generate 1.45 s, vacuum 0.19 s, primary keys 0.48 s).
huge_pages_status
-------------------
off
(1 row)
numa_zone_id | count
--------------+--------
| 507594
3 | 98
-2 | 16498
1 | 98
(4 rows)
patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2)
postgres@jw-test3:~$ numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ;
/usr/pgsql18numa/bin/pgbench -i -s 10; /usr/pgsql18numa/bin/psql -c 'show
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id;' -c '\l+'
[..]
done in 4.89 s (drop tables 0.12 s, create tables 0.01 s, client-side
generate 1.50 s, vacuum 1.32 s, primary keys 1.93 s).
huge_pages_status
-------------------
on
(1 row)
numa_zone_id | count
--------------+--------
| 507596
0 | 256
2 | 256
-2 | 15616
1 | 564
(5 rows)
cross-check that MAP_POPULATE was used above:
postgres@jw-test3:~$ strace -ffe mmap numactl --interleave=all
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart 2>&1 | grep
MAP_SHARED
mmap(NULL, 27028, PROT_READ, MAP_SHARED, 3, 0) = 0x7f954aaa1000
[pid 15686] mmap(NULL, 27028, PROT_READ, MAP_SHARED, 3, 0) =
0x7fecd3d46000
[pid 15686] mmap(NULL, 4454350848, PROT_READ|PROT_WRITE,
MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB, -1, 0) = 0x7febc7200000
postgres@jw-test3:~$ uname -r
6.10.11+bpo-cloud-amd64
v4-0003-Add-pg_shmem_numa_allocations.patch
Description: Binary data
v4-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patch
Description: Binary data
v4-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patch
Description: Binary data
