Hi, On Fri, Mar 14, 2025 at 11:05:28AM +0100, Jakub Wartak wrote: > On Thu, Mar 13, 2025 at 3:15 PM Bertrand Drouvot > <bertranddrouvot...@gmail.com> wrote: > > Hi, > > Thank you very much for the review! I'm answering to both reviews in > one go and the results is attached v12, seems it all should be solved > now:
Thanks for v12! I'll review 0001 and 0003 later, but want to share what I've done for 0002. I did prepare a patch file (attached as .txt to not disturb the cfbot) to apply on top of v11 0002 (I just rebased it a bit so that it now applies on top of v12 0002). The main changes are: === 1 Some doc wording changes. === 2 Some comments, pgindent and friends changes. === 3 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, - pinning_backends int4, numa_zone_id int4); + pinning_backends int4, zone_id int4); I changed numa_zone_id to zone_id as we are already in "numa" related functions and/or views. === 4 - CHECK_FOR_INTERRUPTS(); } + + CHECK_FOR_INTERRUPTS(); I think that it's better to put the CFI outside of the if condition, so that it can be called for every page. === 5 -pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx) +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 6 -get_buffercache_tuple(int i, BufferCachePagesContext *fctx) +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) for clarity. === 7 - int os_page_count = 0; + uint64 os_page_count = 0; I think that's better. === 8 But then, here, we need to change its format to %lu and and to cast to (unsigned long) to make the CI CompilerWarnings happy. That's fine because we don't provide NUMA support for 32 bits anyway. - elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f", - os_page_count, os_page_size, pages_per_blk); + elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", + (unsigned long) os_page_count, os_page_size, pages_per_blk); === 9 -static bool firstUseInBackend = true; +static bool firstNumaTouch = true; Looks better to me but still not 100% convinced by the name. === 10 static BufferCachePagesContext * -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) as PG_FUNCTION_ARGS is usually used for fmgr-compatible function and there is a lof of examples in the code that make use of "FunctionCallInfo" for non fmgr-compatible function. and also: === 11 I don't like the fact that we iterate 2 times over NBuffers in pg_buffercache_numa_pages(). But I'm having having hard time finding a better approach given the fact that pg_numa_query_pages() needs all the pointers "prepared" before it can be called... Those 2 loops are probably the best approach, unless someone has a better idea. === 12 Upthread you asked "Can you please take a look again on this, is this up to the project standards?" Was the question about using pg_buffercache_numa_prepare_ptrs() as an inlined wrapper? What do you think? The comments, doc and code changes are just proposals and are fully open to discussion. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql index 52f63aa258c..42a693aa4d4 100644 --- a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql @@ -28,7 +28,7 @@ CREATE OR REPLACE VIEW pg_buffercache_numa AS SELECT P.* FROM pg_buffercache_numa_pages() AS P (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, - pinning_backends int4, numa_zone_id int4); + pinning_backends int4, zone_id int4); -- Don't want these to be available to public. REVOKE ALL ON FUNCTION pg_buffercache_pages() FROM PUBLIC; diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index b27add81f0a..c5cfa32fa07 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -67,56 +67,59 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary); PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts); PG_FUNCTION_INFO_V1(pg_buffercache_evict); -/* - * To get reliable results we need to "touch pages" once, see - * comments nearby pg_buffercache_numa_prepare_ptrs(). - */ -static bool firstUseInBackend = true; +/* Only need to touch memory once per backend process lifetime */ +static bool firstNumaTouch = true; /* - * Helper routine to map Buffers into addresses that can be - * later consumed by pg_numa_query_pages() + * Helper routine to map Buffers into addresses that is used by + * pg_numa_query_pages(). * - * Many buffers can point to the same page (in case of - * BLCKSZ < 4kB), but we want to also query just first - * address. + * When database block size (BLCKSZ) is smaller than the OS page size (4kB), + * multiple database buffers will map to the same OS memory page. In this case, + * we only need to query the NUMA zone for the first memory address of each + * unique OS page rather than for every buffer. * - * In order to get reliable results we also need to touch - * memory pages, so that inquiry about NUMA zone doesn't - * return -2. + * In order to get reliable results we also need to touch memory pages, so that + * inquiry about NUMA zone doesn't return -2 (which indicates unmapped/unallocated + * pages) */ static inline void -pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, Size os_page_size, +pg_buffercache_numa_prepare_ptrs(int buffer_id, float pages_per_blk, + Size os_page_size, void **os_page_ptrs) { - size_t blk2page = (size_t)(buffer_id * pages_per_blk); + size_t blk2page = (size_t) (buffer_id * pages_per_blk); for (size_t j = 0; j < pages_per_blk; j++) { - size_t blk2pageoff = blk2page + j; + size_t blk2pageoff = blk2page + j; + if (os_page_ptrs[blk2pageoff] == 0) { volatile uint64 touch pg_attribute_unused(); - /* NBuffers count start really from 1 */ - os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + (os_page_size * j); + /* NBuffers starts from 1 */ + os_page_ptrs[blk2pageoff] = (char *) BufferGetBlock(buffer_id + 1) + + (os_page_size * j); - /* We just need to do it only once in backend */ - if (firstUseInBackend) + /* Only need to touch memory once per backend process lifetime */ + if (firstNumaTouch) pg_numa_touch_mem_if_required(touch, os_page_ptrs[blk2pageoff]); - CHECK_FOR_INTERRUPTS(); } + + CHECK_FOR_INTERRUPTS(); } } /* - * Helper routine for pg_buffercache_(numa_)pages. + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). * - * We need fcinfo here and we pass it here with PG_FUNCTION_ARGS + * This is almost identical to pg_buffercache_numa_pages(), but this one performs + * memory mapping inquiries to display NUMA zone information for each buffer. */ static BufferCachePagesContext * -pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) +pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo) { BufferCachePagesContext *fctx; /* User function context. */ MemoryContext oldcontext; @@ -191,64 +194,68 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, PG_FUNCTION_ARGS) } /* - * Helper routine for pg_buffercache_(numa_)pages + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * Build buffer cache information for a single buffer. */ static void -pg_buffercache_build_tuple(int i, BufferCachePagesContext *fctx) +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx) { BufferDesc *bufHdr; uint32 buf_state; - bufHdr = GetBufferDescriptor(i); + bufHdr = GetBufferDescriptor(record_id); /* Lock each buffer header before inspecting. */ buf_state = LockBufHdr(bufHdr); - fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr); - fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); - fctx->record[i].reltablespace = bufHdr->tag.spcOid; - fctx->record[i].reldatabase = bufHdr->tag.dbOid; - fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag); - fctx->record[i].blocknum = bufHdr->tag.blockNum; - fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); - fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr); + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag); + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid; + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid; + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag); + fctx->record[record_id].blocknum = bufHdr->tag.blockNum; + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state); + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state); if (buf_state & BM_DIRTY) - fctx->record[i].isdirty = true; + fctx->record[record_id].isdirty = true; else - fctx->record[i].isdirty = false; + fctx->record[record_id].isdirty = false; /* * Note if the buffer is valid, and has storage created */ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)) - fctx->record[i].isvalid = true; + fctx->record[record_id].isvalid = true; else - fctx->record[i].isvalid = false; + fctx->record[record_id].isvalid = false; - fctx->record[i].numa_zone_id = -1; + fctx->record[record_id].numa_zone_id = -1; UnlockBufHdr(bufHdr, buf_state); } /* - * Helper routine for pg_buffercache_(numa_)pages + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages(). + * + * Format and return a tuple for a single buffer cache entry. */ static Datum -get_buffercache_tuple(int i, BufferCachePagesContext *fctx) +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx) { Datum values[NUM_BUFFERCACHE_PAGES_ELEM]; bool nulls[NUM_BUFFERCACHE_PAGES_ELEM]; HeapTuple tuple; - values[0] = Int32GetDatum(fctx->record[i].bufferid); + values[0] = Int32GetDatum(fctx->record[record_id].bufferid); nulls[0] = false; /* * Set all fields except the bufferid to null if the buffer is unused or * not valid. */ - if (fctx->record[i].blocknum == InvalidBlockNumber || - fctx->record[i].isvalid == false) + if (fctx->record[record_id].blocknum == InvalidBlockNumber || + fctx->record[record_id].isvalid == false) { nulls[1] = true; nulls[2] = true; @@ -266,27 +273,27 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx) } else { - values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber); + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber); nulls[1] = false; - values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace); + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace); nulls[2] = false; - values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase); + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase); nulls[3] = false; - values[4] = ObjectIdGetDatum(fctx->record[i].forknum); + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum); nulls[4] = false; - values[5] = Int64GetDatum((int64) fctx->record[i].blocknum); + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum); nulls[5] = false; - values[6] = BoolGetDatum(fctx->record[i].isdirty); + values[6] = BoolGetDatum(fctx->record[record_id].isdirty); nulls[6] = false; - values[7] = Int16GetDatum(fctx->record[i].usagecount); + values[7] = Int16GetDatum(fctx->record[record_id].usagecount); nulls[7] = false; /* * unused for v1.0 callers, but the array is always long enough */ - values[8] = Int32GetDatum(fctx->record[i].pinning_backends); + values[8] = Int32GetDatum(fctx->record[record_id].pinning_backends); nulls[8] = false; - values[9] = Int32GetDatum(fctx->record[i].numa_zone_id); + values[9] = Int32GetDatum(fctx->record[record_id].numa_zone_id); nulls[9] = false; } @@ -295,10 +302,6 @@ get_buffercache_tuple(int i, BufferCachePagesContext *fctx) return HeapTupleGetDatum(tuple); } -/* - * When updating this routine please sync it with below one: - * pg_buffercache_numa_pages() - */ Datum pg_buffercache_pages(PG_FUNCTION_ARGS) { @@ -359,39 +362,46 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) Size os_page_size = 0; void **os_page_ptrs = NULL; int *os_pages_status = NULL; - int os_page_count = 0; + uint64 os_page_count = 0; float pages_per_blk = 0; funcctx = SRF_FIRSTCALL_INIT(); + if (pg_numa_init() == -1) - elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable."); + elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform"); + fctx = pg_buffercache_init_entries(funcctx, fcinfo); /* - * This is for gathering some NUMA statistics. We might be using - * various DB block sizes (4kB, 8kB , .. 32kB) that end up being - * allocated in various different OS memory pages sizes, so first - * we need to understand the OS memory page size before calling - * move_pages() - */ + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, + * while the OS may have different memory page sizes. + * + * To correctly map between them, we need to: - Determine the OS + * memory page size - Calculate how many OS pages are used by all + * buffer blocks - Calculate how many OS pages are contained within + * each database block + * + * This information is needed before calling move_pages() for NUMA + * zone inquiry. + */ os_page_size = pg_numa_get_pagesize(); os_page_count = ((uint64) NBuffers * BLCKSZ) / os_page_size; pages_per_blk = (float) BLCKSZ / os_page_size; - elog(DEBUG1, "NUMA: os_page_count=%d os_page_size=%zu pages_per_blk=%f", - os_page_count, os_page_size, pages_per_blk); + elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu pages_per_blk=%.2f", + (unsigned long) os_page_count, os_page_size, pages_per_blk); os_page_ptrs = palloc(sizeof(void *) * os_page_count); - os_pages_status = palloc(sizeof(int) * os_page_count); + os_pages_status = palloc(sizeof(uint64) * os_page_count); memset(os_page_ptrs, 0, sizeof(void *) * os_page_count); /* - * If we ever get 0xff back from kernel inquiry, then we probably - * have bug in our buffers to OS page mapping code here - */ + * If we ever get 0xff back from kernel inquiry, then we probably have + * bug in our buffers to OS page mapping code here + */ memset(os_pages_status, 0xff, sizeof(int) * os_page_count); - if (firstUseInBackend) + if (firstNumaTouch) elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts"); /* @@ -405,7 +415,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) for (i = 0; i < NBuffers; i++) { pg_buffercache_build_tuple(i, fctx); - pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs); + pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, + os_page_ptrs); } if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1) @@ -416,10 +427,15 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) int blk2page = (int) i * pages_per_blk; /* - * Technically we can get errors too here and pass that to - * user. Also we could somehow report single DB block spanning - * more than one NUMA zone, but it should be rare. - */ + * Set the NUMA zone ID for this buffer based on the first OS page + * it maps to. + * + * Note: We could check for errors in os_pages_status and report + * them. Also, a single DB block might span multiple NUMA zones if + * it crosses OS pages on zone boundaries, but we only record the + * zone of the first page. This is a simplification but should be + * sufficient for most analyses. + */ fctx->record[i].numa_zone_id = os_pages_status[blk2page]; } } @@ -439,7 +455,7 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS) } else { - firstUseInBackend = false; + firstNumaTouch = false; SRF_RETURN_DONE(funcctx); } } diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml index 75978a6eaed..4b49bb2974a 100644 --- a/doc/src/sgml/pgbuffercache.sgml +++ b/doc/src/sgml/pgbuffercache.sgml @@ -45,8 +45,9 @@ </para> <para> - The similiar <function>pg_buffercache_numa_pages()</function> is a slower - variant of the above, but also can provide NUMA node ID for shared buffer entry. + The <function>pg_buffercache_numa_pages()</function> provides the same information + as <function>pg_buffercache_pages()</function> but is slower because it also + provides the <acronym>NUMA</acronym> node ID per shared buffer entry. The <structname>pg_buffercache_numa</structname> view wraps the function for convenient use. </para> @@ -213,13 +214,14 @@ <title>The <structname>pg_buffercache_numa</structname> View</title> <para> - The definitions of the columns exposed are almost identical to the previous - <structname>pg_buffercache</structname> view, but this one includes one additional - column numa_zone_id as defined in <xref linkend="pgbuffercache-numa-columns"/>. + The definitions of the columns exposed are identical to the + <structname>pg_buffercache</structname> view, except that this one includes + one additional <structfield>zone_id</structfield> column as defined in + <xref linkend="pgbuffercache-numa-columns"/>. </para> <table id="pgbuffercache-numa-columns"> - <title><structname>pg_buffercache_numa</structname> Columns</title> + <title><structname>pg_buffercache_numa</structname> Extra column</title> <tgroup cols="1"> <thead> <row> @@ -235,11 +237,12 @@ <tbody> <row> <entry role="catalog_table_entry"><para role="column_definition"> - <structfield>numa_zone_id</structfield> <type>integer</type> + <structfield>zone_id</structfield> <type>integer</type> </para> <para> - ID (number) of the NUMA node for this particular buffer. NULL if the shared buffer - has not been used yet.On systems without NUMA this usually returns 0. + <acronym>NUMA</acronym> node ID. NULL if the shared buffer + has not been used yet. On systems without <acronym>NUMA</acronym> support + this returns 0. </para></entry> </row> @@ -248,16 +251,10 @@ </table> <para> - This is clone version of the original pg_buffercache view, however it provides - additional <structfield>numa_zone_id</structfield> column. Fetching this - information from OS is costly and might take much longer and querying it is not - recommended by automated or monitoring systems. - </para> - - <para> - As NUMA node ID inquiry for each page requires memory pages to be paged-in, first - execution of this function can take long time especially on systems with bigint - shared_buffers and without huge_pages enabled. + As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages + to be paged-in, the first execution of this function can take a noticeable + amount of time. In all the cases (first execution or not), retrieving this + information is costly and querying the view at a high frequency is not recommended. </para> </sect2>