On Fri, Sep 12, 2014 at 11:55 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Thu, Sep 11, 2014 at 4:31 PM, Andres Freund <and...@2ndquadrant.com> wrote: > > On 2014-09-10 12:17:34 +0530, Amit Kapila wrote: > > > +++ b/src/backend/postmaster/bgreclaimer.c > > > > A fair number of comments in that file refer to bgwriter... > > Will fix.
Fixed. > > > @@ -0,0 +1,302 @@ > > > +/*------------------------------------------------------------------------- > > > + * > > > + * bgreclaimer.c > > > + * > > > + * The background reclaimer (bgreclaimer) is new as of Postgres 9.5. It > > > + * attempts to keep regular backends from having to run clock sweep (which > > > + * they would only do when they don't find a usable shared buffer from > > > + * freelist to read in another page). > > > > That's not really accurate. Freelist pages are often also needed to > > write new pages, without reading anything in. > > Agreed, but the same is used in bgwriter file as well; so if we > change here, we might want to change bgwriter file header as well. I have just changed bgreclaimer for this comment, if the same is required for bgwriter, I can create a separate patch as that change is not related to this patch, so I thought it is better to keep it separate. > > I'd phrase it as "which > > they only need to do if they don't find a victim buffer from the > > freelist" > > victim buffer sounds more like a buffer which it will get from > clock sweep, how about next candidate (same is used in function > header of StrategyGetBuffer()). Fixed. > > > In the best scenario all requests > > > + * for shared buffers will be fulfilled from freelist as the background > > > + * reclaimer process always tries to maintain buffers on freelist. However, > > > + * regular backends are still empowered to run clock sweep to find a usable > > > + * buffer if the bgreclaimer fails to maintain enough buffers on freelist. > > > > "empowered" sounds strange to me. 'still can run the clock sweep'? > > No harm in changing like what you are suggesting, however the same is > used in file header of bgwriter.c as well, so I think lets keep this usage as > it is because there is no correctness issue here. Not changed anything for this comment. > > > > > +static void bgreclaim_quickdie(SIGNAL_ARGS); > > > +static void BgreclaimSigHupHandler(SIGNAL_ARGS); > > > +static void ReqShutdownHandler(SIGNAL_ARGS); > > > +static void bgreclaim_sigusr1_handler(SIGNAL_ARGS); > > > > This looks inconsistent. > > I have kept based on bgwriter, so not sure if it's good to change. > However I we want consistent in naming, I would like to keep > something like: > > ReclaimShutdownHandler > ReclaimQuickDieHandler > .. > .. Changed function names to make them consistent. > > > + /* > > > + * If an exception is encountered, processing resumes here. > > > + * > > > + * See notes in postgres.c about the design of this coding. > > > + */ > > > + if (sigsetjmp(local_sigjmp_buf, 1) != 0) > > > + { > > > + /* Since not using PG_TRY, must reset error stack by hand */ > > > + error_context_stack = NULL; > .. > > > > No LWLockReleaseAll(), AbortBufferIO(), ...? Unconvinced that that's a > > good idea, regardless of it possibly being true today (which I'm not > > sure about yet). > > I will add LWLockReleaseAll() in exception handling as discussed > elsewhere in thread. Done. > > > > + > > > + /* Now we can allow interrupts again */ > > > + RESUME_INTERRUPTS(); > > > > Other processes sleep for a second here, I think that's a good > > idea. E.g. that bit: > > Agreed, will make change as per suggestion. Done. > > > > > + /* > > > + * Loop forever > > > + */ > > > + for (;;) > > > + { > > > + int rc; > > > + > > > + > > > + /* > > > + * Backend will signal bgreclaimer when the number of buffers in > > > + * freelist falls below than low water mark of freelist. > > > + */ > > > + rc = WaitLatch(&MyProc->procLatch, > > > + WL_LATCH_SET | WL_POSTMASTER_DEATH, > > > + -1); > > > > That's probably not going to work well directly after a (re)start of > > bgreclaim (depending on how you handle the water mark, I'll see in a > > bit). > > Could you please be more specific here? I wasn't sure if any change is required here, so kept the code as it is. > > > > > +Background Reclaimer's Processing > > > +--------------------------------- > .. > > > +Two water mark indicators are used to maintain sufficient number of buffers > > > +on freelist. Low water mark indicator is used by backends to wake bgreclaimer > > > +when the number of buffers in freelist falls below it. High water mark > > > +indicator is used by bgreclaimer to move buffers to freelist. > > > > For me the description of the high water as stated here doesn't seem to > > explain anything. > > > > This section should have a description of how the reclaimer interacts > > with the bgwriter logic. Do we put dirty buffers on the freelist that > > are then cleaned by the bgwriter? Which buffers does the bgwriter write > > out? > > As discussed in thread, I will change this accordingly. Done > > Move buffers with reference and a usage_count *of* zero to freelist. By > > maintaining enough buffers in the freelist (up to the list's high water > > mark), we drastically reduce the likelihood of individual backends > > having to perform the clock sweep themselves. > > Okay will rephrase it as per your suggestion. Done > > > + * This is called by the background reclaim process when the number > > > + * of buffers in freelist falls below low water mark of freelist. > > > + */ > > > > The logic used here *definitely* needs to be documented in another form > > somewhere in the source. > > I think the way Robert has suggested to modify Readme adresses > this to an extent, however if you think it is better to go in more > detail, then I can expand function header of function > BgMoveBuffersToFreelist() or on top of bgreclaimer.c, what do you prefer? > > > > + > > > + if (tmp_num_to_free == 0) > > > + break; > > > > num_to_free isn't a convincing name if I understand what this is doing > > correctly. Maybe 'move_to_freelist', 'freelist_needed', > > 'needed_on_freelist' or something like that? > > I think keeping num or count in name could be more helpful as it is used > to loop for finding usable buffers. How about 'num_needed_on_freelist', > without any count or num it sounds more like a boolean variable. Changed. > > > > I wonder if we don't want to increase the high watermark when > > tmp_recent_backend_clocksweep > 0? > > > > > + while (tmp_num_to_free > 0) > > > + { > .. > > > + /* > > > + * If the buffer is pinned or has a nonzero usage_count, we cannot > > > + * move it to freelist; decrement the usage_count (unless pinned) > > > + * and keep scanning. > > > + */ > > > + LockBufHdr(bufHdr); > > > > Hm. Perhaps we should do a bufHdr->refcount != zero check without > > locking here? The atomic op will transfer the cacheline exclusively to > > the reclaimer's CPU. Even though it very shortly afterwards will be > > touched afterwards by the pinning backend. > > As per discussion, the conclusion seems to be that we can do some > more tests to see if we need such a change, I will do more tests on > the lines suggested by Robert in below mails and then we can decide > if any thing is required. Still performance data related to this needs to be collected. > > > +/* > > > + * Water mark indicators for maintaining buffers on freelist. When the > > > + * number of buffers on freelist drops below the low water mark, the > > > + * allocating backend sets the latch and bgreclaimer wakesup and begin > > > + * adding buffer's to freelist until it reaches high water mark and then > > > + * again goes back to sleep. > > > + */ > > > > s/wakesup/wakes up/; s/begin adding/begins adding/; s/buffer's/buffers/; > > /to freelist/to the freelist/; s/reaches high water/reaches the high water/ > > Will change as per suggestions. Done > > > +int freelistLowWaterMark; > > > +int freelistHighWaterMark; > > > + > > > +/* Percentage indicators for maintaining buffers on freelist */ > > > +#define HIGH_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.005 > > > +#define LOW_WATER_MARK_FREELIST_BUFFERS_PERCENT 0.2 > > > +#define MIN_HIGH_WATER_MARK 5 > > > +#define MAX_HIGH_WATER_MARK 2000 > > > > I'm confused. The high water mark percentage is smaller than the low > > water mark? > > High water mark is percentage of NBuffers (shared_buffers). > Low water mark is percentage of High water mark. > > I will add a comment. Added a comment. > > > > > - *lock_held = true; > > > - LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); > > > + SpinLockAcquire(&StrategyControl->freelist_lck); > > > > .. > > > + bgreclaimerLatch = StrategyControl->bgreclaimerLatch; > > > bgwriterLatch = StrategyControl->bgwriterLatch; > > > > I don't understand why these need to be grabbed under the spinlock? > > In earlier version of patch, it was done without spinklock, > however Robert has given comment that as it was previously done > with BufFreelist lock, these should be under spinlock (atleast > that is what I understood) and I tested the patch again by having > them under spinlock and didn't find any difference, so I have moved > them under spinlock. Not changed anything related to this. > .. > > > + if (numFreeListBuffers < freelistLowWaterMark && bgreclaimerLatch) > > > + SetLatch(StrategyControl->bgreclaimerLatch); > > > + > .. > > > } > > > - UnlockBufHdr(buf); > > > } > > > > I think it makes sense to break out this bit into its own > > function. That'll make StrategyGetBuffer() a good bit easier to read. > > I will try to move it into a new function GetBufferFromFreelist(). Moved this part of code into new function. > > > > +/* > > > + * StrategyMoveBufferToFreeListEnd: put a buffer on the end of freelist > > > + */ > > > +bool > > > +StrategyMoveBufferToFreeListEnd(volatile BufferDesc *buf) > > > +{ > > > > Should maybe rather be named *Tail? > > Tail is better suited than End, I will change it. Changed. > > > +void > > > +StrategyGetFreelistAccessInfo(uint32 *num_buf_to_free, uint32 *num_buf_alloc, > > > + uint32 *num_buf_backend_clocksweep) > > > + > > > + if (num_buf_alloc) > > > + { > > > + *num_buf_alloc = StrategyControl->numBufferAllocs; > > > + StrategyControl->numBufferAllocs = 0; > > > + } > > > + if (num_buf_backend_clocksweep) > > > + { > > > + *num_buf_backend_clocksweep = StrategyControl->numBufferBackendClocksweep; > > > + StrategyControl->numBufferBackendClocksweep = 0; > > > + } > > > + SpinLockRelease(&StrategyControl->freelist_lck); > > > + > > > + return; > > > +} > > > > Do we need the if (num_buf_alloc) bits? Can't we make it unconditional? > > This API is more or less designed on lines of StrategySyncStart(). > Currently there is no case where we can't do it unconditional (same is > true for num_buf_backend_clocksweep), however there is some merit > to keep it in sync with existing API. Having said that if you feel we should > go about doing it unconditionally, I will change it in next patch. I have made it unconditional. > Hm, right. But then let's move BgWriterStats.m_buf_alloc =+, > ... pgstat_send_bgwriter(); into that loop. Otherwise it'd possibly end > up being continously busy without being visible. Done and I have removed pgstat_send_bgwriter() call from bgreclaimer loop, as after this change calling it there becomes redundant. Apart from this, I have changed kid of newly added function as due to recent commit, the oid I was using is no longer available. I have taken tpc-b data as well, it is with previous version of patch, however the recent version hasn't changed much to impact performance data. It takes long time to get tpc-b data, thats why I could not report it previously. I will post the data with the latest patch separately (where I will focus on new cases discussed between Robert and Andres). TPC-B Performance Data ---------------------------------------- Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout =15min wal_buffers = 256MB shared_buffers=8GB scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 30mins Client_count/Patch_ver 32 64 128 HEAD 420 535 556 Patch (sbe_v8) 431 537 568 About performance data ------------------------------------- 1. This data is a median of 3 runs, individual run data can be found in attached document (perf_read_scalability_data_v8.ods) 2. There is not much difference in performance between Head and patch which shows that this patch hasn't regressed tpc-b case. Steps to take tpc-b data, for each run: 1. start server 2. drop db 3. create db 4. initialize db 5. run tpc-b pgbench (used prepared statement) 6. checkpoint 7. stop server With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
perf_read_scalability_data_v8.ods
Description: application/vnd.oasis.opendocument.spreadsheet
scalable_buffer_eviction_v9.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers