On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmh...@gmail.com> wrote: > > On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > > I have updated the patch to address the feedback. Main changes are: > > > > 1. For populating freelist, have a separate process (bgreclaimer) > > instead of doing it by bgwriter. > > 2. Autotune the low and high threshold values for buffers > > in freelist. I have used the formula as suggested by you upthread. > > 3. Cleanup of locking regimen as discussed upthread (completely > > eliminated BufFreelist Lock). > > 4. Improved comments and general code cleanup. > > Overall this looks quite promising to me. > > I had thought to call the new process just "bgreclaim" rather than > "bgreclaimer", but perhaps your name is better after all. At least, > it matches what we do elsewhere. But I don't care for the use > "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or > else "bgreclaimer".
Changed it to bgreclaimer. > This is unclear: > > +buffers for replacement. Earlier to protect freelist, we use LWLOCK as that > +is needed to perform clock sweep which is a longer operation, however now we > +are using two spinklocks freelist_lck and victimbuf_lck to perform operations > +on freelist and run clock sweep respectively. > > I would drop the discussion of what was done before and say something > like this: The data structures relating to buffer eviction are > protected by two spinlocks. freelist_lck protects the freelist and > related data structures, while victimbuf_lck protects information > related to the current clock sweep condition. Changed, but I have not used exact wording mentioned above, let me know if new wording used is okay? > +always in this list. We also throw buffers into this list if we consider > +their pages unlikely to be needed soon; this is done by background process > +reclaimer. The list is singly-linked using fields in the > > I suggest: Allocating pages from this list is much cheaper than > running the "clock sweep" algorithm, which may encounter many buffers > that are poor candidates for eviction before finding a good candidate. > Therefore, we have a background process called bgreclaimer which works > to keep this list populated. Changed as per your suggestion. > +Background Reclaimer's Processing > +--------------------------------- > > I suggest titling this section "Background Reclaim". > > +The background reclaimer is designed to move buffers to freelist that are > > I suggest replacing the first three words of this sentence with "bgreclaimer". As per discussion in thread, I have kept it as it is. > +and move the the unpinned and zero usage count buffers to freelist. It > +keep on doing this until the number of buffers in freelist become equal > +high threshold of freelist. > > s/keep/keeps/ > s/become equal/reaches the/ > s/high threshold/high water mark/ > s/of freelist// Changed as per your suggestion. > Please change the other places that say threshold to use the "water > mark" terminology. > > + if (StrategyMoveBufferToFreeListEnd (bufHdr)) > > Extra space. > > + * buffers in consecutive cycles. > > s/consecutive/later/ > > + /* Execute the LRU scan */ > > s/LRU scan/clock sweep/ ? Changed as per your suggestion. > > + while (tmp_num_to_free > 0) > > I am not sure it's a good idea for this value to be fixed at loop > start and then just decremented. Shouldn't we loop and do the whole > thing over once we reach the high watermark, only stopping when > StrategySyncStartAndEnd() says num_to_free is 0? Okay, changed the loop as per your suggestion. > + /* choose next victim buffer to clean. */ > > This process doesn't clean buffers; it puts them on the freelist. Right. Changed it to match what it does. > + * high threshold of freelsit), we drasticaly reduce the odds for > > Two typos. Fixed. > + * of buffers in freelist fall below low threshold of freelist. > > s/fall/falls/ Changed as per your suggestion. > In freelist.c, it seems like a poor idea to have two spinlocks as > consecutive structure members; they'll be in the same cache line, > leading to false sharing. If we merge them into a single spinlock, > does that hurt performance? If we put them further apart, e.g. by > moving the freelist_lck to the start of the structure, followed by the > latches, and leaving victimbuf_lck where it is, does that help > performance? As per discussion, I have kept them as it is and added a comment indicating that we can consider having both locks in separate cache lines. > + /* > + * If the buffer is pinned or has a nonzero usage_count, > we cannot use > + * it; discard it and retry. (This can only happen if VACUUM put a > + * valid buffer in the freelist and then someone else > used it before > + * we got to it. It's probably impossible altogether as > of 8.3, but > + * we'd better check anyway.) > + */ > + > > This comment is clearly obsolete. Removed. > > I have not yet added statistics (buffers_backend_clocksweep) as > > for that we need to add one more variable in BufferStrategyControl > > structure where I have already added few variables for this patch. > > I think it is important to have such a stat available via > > pg_stat_bgwriter, but not sure if it is worth to make the structure > > bit more bulky. > > I think it's worth it. Okay added new statistic. > > Another minor point is about changes in lwlock.h > > lwlock.h > > * if you remove a lock, consider leaving a gap in the numbering > > * sequence for the benefit of DTrace and other external debugging > > * scripts. > > > > As I have removed BufFreelist lock, I have adjusted the numbering > > as well in lwlock.h. There is a meesage on top of lock definitions > > which suggest to leave gap if we remove any lock, however I was not > > sure whether this case (removing the first element) can effect anything, > > so for now, I have adjusted the numbering. > > Let's leave slot 0 unused, instead. Sure, that make sense. Apart from above, I think for this patch, cat version bump is required as I have modified system catalog. However I have not done the same in patch as otherwise it will be bit difficult to take performance data. Performance Data with updated patch Configuration and Db Details IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB Database Locale =C checkpoint_segments=256 checkpoint_timeout =15min shared_buffers=8GB scale factor = 3000 Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5mins All the data is in tps and taken using pgbench read-only load Client Count/Patch_Ver (tps) 8 16 32 64 128 HEAD 58614 107370 140717 104357 65010 Patch 60092 113564 165014 213848 216065 This data is median of 3 runs, detailed report is attached with mail. I have not repeated the test for all configurations, as there is no major change in design/algorithm which can effect performance. Mark has already taken tpc-b data which ensures that there is no problem with it, however I will also take it once with latest version. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
scalable_buffer_eviction_v6.patch
Description: Binary data
perf_read_scalability_data_v6.ods
Description: application/vnd.oasis.opendocument.spreadsheet
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers