On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
>
> On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
> >
> > On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.m...@gmail.com> 
> > wrote:
> > >
> > > On Fri, Mar 17, 2023 at 4:03 PM John Naylor
> > > <john.nay...@enterprisedb.com> wrote:
> > > >
> > > > On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > > > wrote:
> > > > >
> > > > > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > > > > <john.nay...@enterprisedb.com> wrote:
> > > > > >
> > > > > > I wrote:
> > > > > >
> > > > > > > > > Since the block-level measurement is likely overestimating 
> > > > > > > > > quite a bit, I propose to simply reverse the order of the 
> > > > > > > > > actions here, effectively reporting progress for the *last 
> > > > > > > > > page* and not the current one: First update progress with the 
> > > > > > > > > current memory usage, then add tids for this page. If this 
> > > > > > > > > allocated a new block, only a small bit of that will be 
> > > > > > > > > written to. If this block pushes it over the limit, we will 
> > > > > > > > > detect that up at the top of the loop. It's kind of like our 
> > > > > > > > > earlier attempts at a "fudge factor", but simpler and less 
> > > > > > > > > brittle. And, as far as OS pages we have actually written to, 
> > > > > > > > > I think it'll effectively respect the memory limit, at least 
> > > > > > > > > in the local mem case. And the numbers will make sense.
> > > >
> > > > > > I still like my idea at the top of the page -- at least for vacuum 
> > > > > > and m_w_m. It's still not completely clear if it's right but I've 
> > > > > > got nothing better. It also ignores the work_mem issue, but I've 
> > > > > > given up anticipating all future cases at the moment.
> > > >
> > > > > IIUC you suggested measuring memory usage by tracking how much memory
> > > > > chunks are allocated within a block. If your idea at the top of the
> > > > > page follows this method, it still doesn't deal with the point Andres
> > > > > mentioned.
> > > >
> > > > Right, but that idea was orthogonal to how we measure memory use, and 
> > > > in fact mentions blocks specifically. The re-ordering was just to make 
> > > > sure that progress reporting didn't show current-use > max-use.
> > >
> > > Right. I still like your re-ordering idea. It's true that the most
> > > area of the last allocated block before heap scanning stops is not
> > > actually used yet. I'm guessing we can just check if the context
> > > memory has gone over the limit. But I'm concerned it might not work
> > > well in systems where overcommit memory is disabled.
> > >
> > > >
> > > > However, the big question remains DSA, since a new segment can be as 
> > > > large as the entire previous set of allocations. It seems it just 
> > > > wasn't designed for things where memory growth is unpredictable.
> >
> > aset.c also has a similar characteristic; allocates an 8K block upon
> > the first allocation in a context, and doubles that size for each
> > successive block request. But we can specify the initial block size
> > and max blocksize. This made me think of another idea to specify both
> > to DSA and both values are calculated based on m_w_m. For example, we
>
> That's an interesting idea, and the analogous behavior to aset could be a 
> good thing for readability and maintainability. Worth seeing if it's workable.

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable. The patch includes a test
function, test_memory_usage() to simulate how DSA segments grow behind
the shared radix tree. If we set the first argument to true, it
calculates both initial and maximum block size based on work_mem (I
used work_mem here just because its value range is larger than m_w_m):

postgres(1:833654)=# select test_memory_usage(true);
NOTICE:  memory limit 134217728
NOTICE:  init 1048576 max 16777216
NOTICE:  initial: 1048576
NOTICE:  rt_create: 1048576
NOTICE:  allocate new DSM [1] 1048576
NOTICE:  allocate new DSM [2] 2097152
NOTICE:  allocate new DSM [3] 2097152
NOTICE:  allocate new DSM [4] 4194304
NOTICE:  allocate new DSM [5] 4194304
NOTICE:  allocate new DSM [6] 8388608
NOTICE:  allocate new DSM [7] 8388608
NOTICE:  allocate new DSM [8] 16777216
NOTICE:  allocate new DSM [9] 16777216
NOTICE:  allocate new DSM [10] 16777216
NOTICE:  allocate new DSM [11] 16777216
NOTICE:  allocate new DSM [12] 16777216
NOTICE:  allocate new DSM [13] 16777216
NOTICE:  allocate new DSM [14] 16777216
NOTICE:  reached: 148897792 (+14680064)
NOTICE:  12718205 keys inserted: 148897792
 test_memory_usage
-------------------

(1 row)

Time: 7195.664 ms (00:07.196)

Setting the first argument to false, we can specify both manually in
second and third arguments:

postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024
* 1024 * 1024 * 10::bigint);
NOTICE:  memory limit 134217728
NOTICE:  init 1048576 max 10737418240
NOTICE:  initial: 1048576
NOTICE:  rt_create: 1048576
NOTICE:  allocate new DSM [1] 1048576
NOTICE:  allocate new DSM [2] 2097152
NOTICE:  allocate new DSM [3] 2097152
NOTICE:  allocate new DSM [4] 4194304
NOTICE:  allocate new DSM [5] 4194304
NOTICE:  allocate new DSM [6] 8388608
NOTICE:  allocate new DSM [7] 8388608
NOTICE:  allocate new DSM [8] 16777216
NOTICE:  allocate new DSM [9] 16777216
NOTICE:  allocate new DSM [10] 33554432
NOTICE:  allocate new DSM [11] 33554432
NOTICE:  allocate new DSM [12] 67108864
NOTICE:  reached: 199229440 (+65011712)
NOTICE:  12718205 keys inserted: 199229440
 test_memory_usage
-------------------

(1 row)

Time: 7187.571 ms (00:07.188)

It seems to work fine. The differences between the above two cases is
the maximum block size (16MB .vs 10GB). We allocated two more DSA
segments in the first segments but there was no big difference in the
performance in my test environment.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index ad66265e23..12121dd1d4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -86,3 +86,12 @@ OUT iter_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function test_memory_usage(
+use_m_w_m bool,
+init_blksize int8 default (1024 * 1024),
+max_blksize int8 default (1024 * 1024 * 1024 * 10::bigint)
+)
+returns void
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 41d83aee11..0580faed6c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -40,6 +40,18 @@ PG_MODULE_MAGIC;
 // #define RT_SHMEM
 #include "lib/radixtree.h"
 
+//#define RT_DEBUG
+#define RT_PREFIX shared_rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+//#define RT_USE_DELETE
+//#define RT_MEASURE_MEMORY_USAGE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+#define RT_SHMEM
+#include "lib/radixtree.h"
+
 /*
  * Return the number of keys in the radix tree.
  */
@@ -57,6 +69,7 @@ PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 PG_FUNCTION_INFO_V1(bench_node128_load);
 PG_FUNCTION_INFO_V1(bench_tidstore_load);
+PG_FUNCTION_INFO_V1(test_memory_usage);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -745,4 +758,56 @@ stub_iter()
        iter = rt_begin_iterate(rt);
        rt_iterate_next(iter, &key, &value);
        rt_end_iterate(iter);
-}
\ No newline at end of file
+}
+
+Datum
+test_memory_usage(PG_FUNCTION_ARGS)
+{
+       bool    use_work_mem = PG_GETARG_BOOL(0);
+       int64   init = PG_GETARG_INT64(1);
+       int64   max = PG_GETARG_INT64(2);
+       int tranche_id = LWLockNewTrancheId();
+       const int limit = work_mem * 1024;
+       dsa_area *dsa;
+       shared_rt_radix_tree *rt;
+       uint64 i;
+
+       LWLockRegisterTranche(tranche_id, "test");
+
+       if (use_work_mem)
+       {
+               init = Min(((int64)work_mem * 1024) / 4, 1024 * 1024);
+               max = Max(((int64)work_mem * 1024) / 8, (int64) 8 * 1024 * 
1024);
+       }
+
+       elog(NOTICE, "memory limit %ld", (int64) work_mem * 1024);
+       elog(NOTICE, "init %ld max %ld", init, max);
+       dsa = dsa_create_ext(tranche_id, init, max);
+
+       elog(NOTICE, "initial: %zu", dsa_get_total_segment_size(dsa));
+
+       rt = shared_rt_create(CurrentMemoryContext, dsa, tranche_id);
+       elog(NOTICE, "rt_create: %zu", dsa_get_total_segment_size(dsa));
+
+       for (i = 0; i < (1000 * 1000 * 1000); i++)
+       {
+               volatile bool ret;
+               size_t size;
+
+               ret = shared_rt_set(rt, i, &i);
+
+               size = dsa_get_total_segment_size(dsa);
+
+               if (limit < size)
+               {
+                       elog(NOTICE, "reached: %zu (+%zu)", size, size - limit);
+                       break;
+               }
+       }
+
+       elog(NOTICE, "%ld keys inserted: %zu", i, 
dsa_get_total_segment_size(dsa));
+
+       shared_rt_free(rt);
+       dsa_detach(dsa);
+       PG_RETURN_VOID();
+}
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..a81008d84e 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -60,14 +60,6 @@
 #include "utils/freepage.h"
 #include "utils/memutils.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -77,17 +69,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27            /* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40            /* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -98,9 +79,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK               16
 
@@ -319,6 +297,10 @@ typedef struct
        dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
        /* The object pools for each size class. */
        dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+       /* initial allocation segment size */
+       size_t          init_segment_size;
+       /* maximum allocation segment size */
+       size_t          max_segment_size;
        /* The total size of all active segments. */
        size_t          total_segment_size;
        /* The maximum total size of backing storage we are allowed. */
@@ -413,7 +395,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, 
size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
                                                                 int tranche_id,
                                                                 dsm_handle 
control_handle,
-                                                                dsm_segment 
*control_segment);
+                                                                dsm_segment 
*control_segment,
+                                                                size_t 
init_segment_size,
+                                                                size_t 
max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
                                                                 dsa_handle 
handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -429,7 +413,7 @@ static void check_for_freed_segments_locked(dsa_area *area);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t 
max_segment_size)
 {
        dsm_segment *segment;
        dsa_area   *area;
@@ -438,7 +422,7 @@ dsa_create(int tranche_id)
         * Create the DSM segment that will hold the shared control object and 
the
         * first segment of usable space.
         */
-       segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+       segment = dsm_create(init_segment_size, 0);
 
        /*
         * All segments backing this area are pinned, so that DSA can explicitly
@@ -450,9 +434,10 @@ dsa_create(int tranche_id)
 
        /* Create a new DSA area with the control object in this segment. */
        area = create_internal(dsm_segment_address(segment),
-                                                  DSA_INITIAL_SEGMENT_SIZE,
+                                                  init_segment_size,
                                                   tranche_id,
-                                                  dsm_segment_handle(segment), 
segment);
+                                                  dsm_segment_handle(segment), 
segment,
+                                                  init_segment_size, 
max_segment_size);
 
        /* Clean up when the control segment detaches. */
        on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -478,13 +463,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-                                       int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+                                               int tranche_id, dsm_segment 
*segment,
+                                               size_t init_segment_size, 
size_t max_segment_size)
 {
        dsa_area   *area;
 
        area = create_internal(place, size, tranche_id,
-                                                  DSM_HANDLE_INVALID, NULL);
+                                                  DSM_HANDLE_INVALID, NULL,
+                                                  init_segment_size, 
max_segment_size);
 
        /*
         * Clean up when the control segment detaches, if a containing DSM 
segment
@@ -1024,6 +1011,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
        LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_segment_size(dsa_area *area)
+{
+       size_t size;
+
+       LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+       size = area->control->total_segment_size;
+       LWLockRelease(DSA_AREA_LOCK(area));
+
+       return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
@@ -1203,7 +1202,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
                                int tranche_id,
                                dsm_handle control_handle,
-                               dsm_segment *control_segment)
+                               dsm_segment *control_segment,
+                               size_t init_segment_size, size_t 
max_segment_size)
 {
        dsa_area_control *control;
        dsa_area   *area;
@@ -1213,6 +1213,9 @@ create_internal(void *place, size_t size,
        size_t          metadata_bytes;
        int                     i;
 
+       Assert(max_segment_size >= init_segment_size);
+       Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
        /* Sanity check on the space we have to work in. */
        if (size < dsa_minimum_size())
                elog(ERROR, "dsa_area space must be at least %zu, but %zu 
provided",
@@ -1242,8 +1245,10 @@ create_internal(void *place, size_t size,
        control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
        control->segment_header.usable_pages = usable_pages;
        control->segment_header.freed = false;
-       control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+       control->segment_header.size = size;
        control->handle = control_handle;
+       control->init_segment_size = init_segment_size;
+       control->max_segment_size = max_segment_size;
        control->max_total_segment_size = (size_t) -1;
        control->total_segment_size = size;
        control->segment_handles[0] = control_handle;
@@ -2112,12 +2117,13 @@ make_new_segment(dsa_area *area, size_t requested_pages)
         * move to huge pages in the future.  Then we work back to the number of
         * pages we can fit.
         */
-       total_size = DSA_INITIAL_SEGMENT_SIZE *
+       total_size = area->control->init_segment_size *
                ((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-       total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+       total_size = Min(total_size, area->control->max_segment_size);
        total_size = Min(total_size,
                                         area->control->max_total_segment_size -
                                         area->control->total_segment_size);
+       elog(NOTICE, "allocate new DSM [%zu] %zu", new_index, total_size);
 
        total_pages = total_size / FPM_PAGE_SIZE;
        metadata_bytes =
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..0baa32b9de 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,28 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create.  After creating some number of segments of this size we'll
+ * double this size, and so on.  Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27            /* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40            /* 1024 segments of size up to 1TB */
+#endif
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +110,14 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
        dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+       dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, 
DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory 
space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+       dsa_create_in_place_ext(place, size, tranch_id, segment, 
DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +132,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-                                                                        int 
tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+                                                               size_t 
max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+                                                                               
 int tranche_id, dsm_segment *segment,
+                                                                               
 size_t init_segment_size,
+                                                                               
 size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
@@ -117,6 +149,7 @@ extern void dsa_pin(dsa_area *area);
 extern void dsa_unpin(dsa_area *area);
 extern void dsa_set_size_limit(dsa_area *area, size_t limit);
 extern size_t dsa_minimum_size(void);
+extern size_t dsa_get_total_segment_size(dsa_area *area);
 extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int 
flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);

Reply via email to