Hi,
On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote:
> On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot
> <[email protected]> wrote:
>
> Hi,
>
> Thank you very much for the review! I'm answering to both reviews in
> one go and the results is attached v12, seems it all should be solved
> now:
Thanks for v12!
I'll review 0001 and 0003 later, but want to share what I've done for 0002.
I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply
on top of v11 0002 (I just rebased it a bit so that it now applies on top of
v12 0002).
The main changes are:
=== 1
Some doc wording changes.
=== 2
Some comments, pgindent and friends changes.
=== 3
relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
- pinning_backends int4, numa_zone_id int4);
+ pinning_backends int4, zone_id int4);
I changed numa_zone_id to zone_id as we are already in "numa" related functions
and/or views.
=== 4
- CHECK_FOR_INTERRUPTS();
}
+
+ CHECK_FOR_INTERRUPTS();
I think that it's better to put the CFI outside of the if condition, so that it
can be called for every page.
=== 5
-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
for clarity.
=== 6
-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
for clarity.
=== 7
- int os_page_count = 0;
+ uint64 os_page_count = 0;
I think that's better.
=== 8
But then, here, we need to change its format to %lu and and to cast to
(unsigned long)
to make the CI CompilerWarnings happy.
That's fine because we don't provide NUMA support for 32 bits anyway.
- elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu
pages_per_blk=%f",
- os_page_count, os_page_size, pages_per_blk);
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size,
pages_per_blk);
=== 9
-static bool firstUseInBackend = true;
+static bool firstNumaTouch = true;
Looks better to me but still not 100% convinced by the name.
=== 10
static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is
a lof of examples in the code that make use of "FunctionCallInfo" for non
fmgr-compatible function.
and also:
=== 11
I don't like the fact that we iterate 2 times over NBuffers in
pg_buffercache_numa_pages().
But I'm having having hard time finding a better approach given the fact that
pg_numa_query_pages() needs all the pointers "prepared" before it can be
called...
Those 2 loops are probably the best approach, unless someone has a better idea.
=== 12
Upthread you asked "Can you please take a look again on this, is this up to the
project standards?"
Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined
wrapper?
What do you think? The comments, doc and code changes are just proposals and are
fully open to discussion.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
index 52f63aa258c..42a693aa4d4 100644
--- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -28,7 +28,7 @@ CREATE OR REPLACE VIEW pg_buffercache_numa AS
SELECT P.* FROM pg_buffercache_numa_pages() AS P
(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
- pinning_backends int4, numa_zone_id int4);
+ pinning_backends int4, zone_id int4);
-- Don't want these to be available to public.
REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b27add81f0a..c5cfa32fa07 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -67,56 +67,59 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
PG_FUNCTION_INFO_V1(pg_buffercache_evict);
-/*
- * To get reliable results we need to "touch pages" once, see
- * comments nearby pg_buffercache_numa_prepare_ptrs().
- */
-static bool firstUseInBackend = true;
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
/*
- * Helper routine to map Buffers into addresses that can be
- * later consumed by pg_numa_query_pages()
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages().
*
- * Many buffers can point to the same page (in case of
- * BLCKSZ < 4kB), but we want to also query just first
- * address.
+ * When database block size (BLCKSZ) is smaller than the OS page size (4kB),
+ * multiple database buffers will map to the same OS memory page. In this case,
+ * we only need to query the NUMA zone for the first memory address of each
+ * unique OS page rather than for every buffer.
*
- * In order to get reliable results we also need to touch
- * memory pages, so that inquiry about NUMA zone doesn't
- * return -2.
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated
+ * pages)
*/
static inline void
-pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size,
+pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk,
+ Size os_page_size,
void **os_page_ptrs)
{
- size_t blk2page = (size_t)(buffer_id * pages_per_blk);
+ size_t blk2page = (size_t) (buffer_id * pages_per_blk);
for (size_t j = 0; j < pages_per_blk; j++)
{
- size_t blk2pageoff = blk2page + j;
+ size_t blk2pageoff = blk2page + j;
+
if (os_page_ptrs[blk2pageoff] == 0)
{
volatile uint64 touch pg_attribute_unused();
- /* NBuffers count start really from 1 */
- os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j);
+ /* NBuffers starts from 1 */
+ os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) +
+ (os_page_size * j);
- /* We just need to do it only once in backend */
- if (firstUseInBackend)
+ /* Only need to touch memory once per backend process lifetime */
+ if (firstNumaTouch)
pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]);
- CHECK_FOR_INTERRUPTS();
}
+
+ CHECK_FOR_INTERRUPTS();
}
}
/*
- * Helper routine for pg_buffercache_(numa_)pages.
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
*
- * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS
+ * This is almost identical to pg_buffercache_numa_pages(), but this one performs
+ * memory mapping inquiries to display NUMA zone information for each buffer.
*/
static BufferCachePagesContext *
-pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
{
BufferCachePagesContext *fctx; /* User function context. */
MemoryContext oldcontext;
@@ -191,64 +194,68 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS)
}
/*
- * Helper routine for pg_buffercache_(numa_)pages
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Build buffer cache information for a single buffer.
*/
static void
-pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx)
+pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
{
BufferDesc *bufHdr;
uint32 buf_state;
- bufHdr = GetBufferDescriptor(i);
+ bufHdr = GetBufferDescriptor(record_id);
/* Lock each buffer header before inspecting. */
buf_state = LockBufHdr(bufHdr);
- fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
- fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
- fctx->record[i].reltablespace = bufHdr->tag.spcOid;
- fctx->record[i].reldatabase = bufHdr->tag.dbOid;
- fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
- fctx->record[i].blocknum = bufHdr->tag.blockNum;
- fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
- fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+ fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
+ fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+ fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
+ fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
+ fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
+ fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
+ fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+ fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
if (buf_state & BM_DIRTY)
- fctx->record[i].isdirty = true;
+ fctx->record[record_id].isdirty = true;
else
- fctx->record[i].isdirty = false;
+ fctx->record[record_id].isdirty = false;
/*
* Note if the buffer is valid, and has storage created
*/
if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
- fctx->record[i].isvalid = true;
+ fctx->record[record_id].isvalid = true;
else
- fctx->record[i].isvalid = false;
+ fctx->record[record_id].isvalid = false;
- fctx->record[i].numa_zone_id = -1;
+ fctx->record[record_id].numa_zone_id = -1;
UnlockBufHdr(bufHdr, buf_state);
}
/*
- * Helper routine for pg_buffercache_(numa_)pages
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
*/
static Datum
-get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
{
Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
HeapTuple tuple;
- values[0] = Int32GetDatum(fctx->record[i].bufferid);
+ values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
nulls[0] = false;
/*
* Set all fields except the bufferid to null if the buffer is unused or
* not valid.
*/
- if (fctx->record[i].blocknum == InvalidBlockNumber ||
- fctx->record[i].isvalid == false)
+ if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
+ fctx->record[record_id].isvalid == false)
{
nulls[1] = true;
nulls[2] = true;
@@ -266,27 +273,27 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
}
else
{
- values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
+ values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
nulls[1] = false;
- values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
+ values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
nulls[2] = false;
- values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
+ values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
nulls[3] = false;
- values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
+ values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
nulls[4] = false;
- values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
+ values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
nulls[5] = false;
- values[6] = BoolGetDatum(fctx->record[i].isdirty);
+ values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
nulls[6] = false;
- values[7] = Int16GetDatum(fctx->record[i].usagecount);
+ values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
nulls[7] = false;
/*
* unused for v1.0 callers, but the array is always long enough
*/
- values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
+ values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends);
nulls[8] = false;
- values[9] = Int32GetDatum(fctx->record[i].numa_zone_id);
+ values[9] = Int32GetDatum(fctx->record[record_id].numa_zone_id);
nulls[9] = false;
}
@@ -295,10 +302,6 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx)
return HeapTupleGetDatum(tuple);
}
-/*
- * When updating this routine please sync it with below one:
- * pg_buffercache_numa_pages()
- */
Datum
pg_buffercache_pages(PG_FUNCTION_ARGS)
{
@@ -359,39 +362,46 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
Size os_page_size = 0;
void **os_page_ptrs = NULL;
int *os_pages_status = NULL;
- int os_page_count = 0;
+ uint64 os_page_count = 0;
float pages_per_blk = 0;
funcctx = SRF_FIRSTCALL_INIT();
+
if (pg_numa_init() == -1)
- elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
fctx = pg_buffercache_init_entries(funcctx, fcinfo);
/*
- * This is for gathering some NUMA statistics. We might be using
- * various DB block sizes (4kB, 8kB , .. 32kB) that end up being
- * allocated in various different OS memory pages sizes, so first
- * we need to understand the OS memory page size before calling
- * move_pages()
- */
+ * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+ * while the OS may have different memory page sizes.
+ *
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
+ *
+ * This information is needed before calling move_pages() for NUMA
+ * zone inquiry.
+ */
os_page_size = pg_numa_get_pagesize();
os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size;
pages_per_blk = (float) BLCKSZ / os_page_size;
- elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f",
- os_page_count, os_page_size, pages_per_blk);
+ elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f",
+ (unsigned long) os_page_count, os_page_size, pages_per_blk);
os_page_ptrs = palloc(sizeof(void *) * os_page_count);
- os_pages_status = palloc(sizeof(int) * os_page_count);
+ os_pages_status = palloc(sizeof(uint64) * os_page_count);
memset(os_page_ptrs, 0, sizeof(void *) * os_page_count);
/*
- * If we ever get 0xff back from kernel inquiry, then we probably
- * have bug in our buffers to OS page mapping code here
- */
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
+ */
memset(os_pages_status, 0xff, sizeof(int) * os_page_count);
- if (firstUseInBackend)
+ if (firstNumaTouch)
elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
/*
@@ -405,7 +415,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
for (i = 0; i < NBuffers; i++)
{
pg_buffercache_build_tuple(i, fctx);
- pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs);
+ pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size,
+ os_page_ptrs);
}
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
@@ -416,10 +427,15 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
int blk2page = (int) i * pages_per_blk;
/*
- * Technically we can get errors too here and pass that to
- * user. Also we could somehow report single DB block spanning
- * more than one NUMA zone, but it should be rare.
- */
+ * Set the NUMA zone ID for this buffer based on the first OS page
+ * it maps to.
+ *
+ * Note: We could check for errors in os_pages_status and report
+ * them. Also, a single DB block might span multiple NUMA zones if
+ * it crosses OS pages on zone boundaries, but we only record the
+ * zone of the first page. This is a simplification but should be
+ * sufficient for most analyses.
+ */
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}
}
@@ -439,7 +455,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
}
else
{
- firstUseInBackend = false;
+ firstNumaTouch = false;
SRF_RETURN_DONE(funcctx);
}
}
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 75978a6eaed..4b49bb2974a 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -45,8 +45,9 @@
</para>
<para>
- The similiar <function>pg_buffercache_numa_pages()</function> is a slower
- variant of the above, but also can provide NUMA node ID for shared buffer entry.
+ The <function>pg_buffercache_numa_pages()</function> provides the same information
+ as <function>pg_buffercache_pages()</function> but is slower because it also
+ provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
The <structname>pg_buffercache_numa</structname> view wraps the function for
convenient use.
</para>
@@ -213,13 +214,14 @@
<title>The <structname>pg_buffercache_numa</structname> View</title>
<para>
- The definitions of the columns exposed are almost identical to the previous
- <structname>pg_buffercache</structname> view, but this one includes one additional
- column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>.
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>zone_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
</para>
<table id="pgbuffercache-numa-columns">
- <title><structname>pg_buffercache_numa</structname> Columns</title>
+ <title><structname>pg_buffercache_numa</structname> Extra column</title>
<tgroup cols="1">
<thead>
<row>
@@ -235,11 +237,12 @@
<tbody>
<row>
<entry role="catalog_table_entry"><para role="column_definition">
- <structfield>numa_zone_id</structfield> <type>integer</type>
+ <structfield>zone_id</structfield> <type>integer</type>
</para>
<para>
- ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer
- has not been used yet.On systems without NUMA this usually returns 0.
+ <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+ has not been used yet. On systems without <acronym>NUMA</acronym> support
+ this returns 0.
</para></entry>
</row>
@@ -248,16 +251,10 @@
</table>
<para>
- This is clone version of the original pg_buffercache view, however it provides
- additional <structfield>numa_zone_id</structfield> column. Fetching this
- information from OS is costly and might take much longer and querying it is not
- recommended by automated or monitoring systems.
- </para>
-
- <para>
- As NUMA node ID inquiry for each page requires memory pages to be paged-in, first
- execution of this function can take long time especially on systems with bigint
- shared_buffers and without huge_pages enabled.
+ As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+ to be paged-in, the first execution of this function can take a noticeable
+ amount of time. In all the cases (first execution or not), retrieving this
+ information is costly and querying the view at a high frequency is not recommended.
</para>
</sect2>