Re: [HACKERS] Scaling shared buffer eviction

Amit Kapila Fri, 05 Sep 2014 04:48:54 -0700

On Wed, Sep 3, 2014 at 1:45 AM, Robert Haas <robertmh...@gmail.com> wrote:
>
> On Thu, Aug 28, 2014 at 7:11 AM, Amit Kapila <amit.kapil...@gmail.com>
wrote:
> > I have updated the patch to address the feedback.  Main changes are:
> >
> > 1. For populating freelist, have a separate process (bgreclaimer)
> > instead of doing it by bgwriter.
> > 2. Autotune the low and high threshold values for buffers
> > in freelist. I have used the formula as suggested by you upthread.
> > 3. Cleanup of locking regimen as discussed upthread (completely
> > eliminated BufFreelist Lock).
> > 4. Improved comments and general code cleanup.
>
> Overall this looks quite promising to me.
>
> I had thought to call the new process just "bgreclaim" rather than
> "bgreclaimer", but perhaps your name is better after all.  At least,
> it matches what we do elsewhere.  But I don't care for the use
> "Bgreclaimer"; let's do "BgReclaimer" if we really need mixed-case, or
> else "bgreclaimer".


Changed it to bgreclaimer.


> This is unclear:
>
> +buffers for replacement.  Earlier to protect freelist, we use LWLOCK as
that
> +is needed to perform clock sweep which is a longer operation, however
now we
> +are using two spinklocks freelist_lck and victimbuf_lck to perform
operations
> +on freelist and run clock sweep respectively.
>
> I would drop the discussion of what was done before and say something
> like this: The data structures relating to buffer eviction are
> protected by two spinlocks.  freelist_lck protects the freelist and
> related data structures, while victimbuf_lck protects information
> related to the current clock sweep condition.

Changed, but I have not used exact wording mentioned above, let me know
if new wording used is okay?

> +always in this list.  We also throw buffers into this list if we consider
> +their pages unlikely to be needed soon; this is done by background
process
> +reclaimer.  The list is singly-linked using fields in the
>
> I suggest: Allocating pages from this list is much cheaper than
> running the "clock sweep" algorithm, which may encounter many buffers
> that are poor candidates for eviction before finding a good candidate.
> Therefore, we have a background process called bgreclaimer which works
> to keep this list populated.

Changed as per your suggestion.

> +Background Reclaimer's Processing
> +---------------------------------
>
> I suggest titling this section "Background Reclaim".
>
> +The background reclaimer is designed to move buffers to freelist that are
>
> I suggest replacing the first three words of this sentence with
"bgreclaimer".

As per discussion in thread, I have kept it as it is.

> +and move the the unpinned and zero usage count buffers to freelist.  It
> +keep on doing this until the number of buffers in freelist become equal
> +high threshold of freelist.
>
> s/keep/keeps/
> s/become equal/reaches the/
> s/high threshold/high water mark/
> s/of freelist//

Changed as per your suggestion.

> Please change the other places that say threshold to use the "water
> mark" terminology.
>
> +                if (StrategyMoveBufferToFreeListEnd (bufHdr))
>
> Extra space.
>
> +                 * buffers in consecutive cycles.
>
> s/consecutive/later/
>
> +    /* Execute the LRU scan */
>
> s/LRU scan/clock sweep/ ?

Changed as per your suggestion.

>
> +    while (tmp_num_to_free > 0)
>
> I am not sure it's a good idea for this value to be fixed at loop
> start and then just decremented.  Shouldn't we loop and do the whole
> thing over once we reach the high watermark, only stopping when
> StrategySyncStartAndEnd() says num_to_free is 0?

Okay, changed the loop as per your suggestion.

> +        /* choose next victim buffer to clean. */
>
> This process doesn't clean buffers; it puts them on the freelist.

Right. Changed it to match what it does.

> + * high threshold of freelsit), we drasticaly reduce the odds for
>
> Two typos.

Fixed.

> + * of buffers in freelist fall below low threshold of freelist.
>
> s/fall/falls/

Changed as per your suggestion.

> In freelist.c, it seems like a poor idea to have two spinlocks as
> consecutive structure members; they'll be in the same cache line,
> leading to false sharing.  If we merge them into a single spinlock,
> does that hurt performance?  If we put them further apart, e.g. by
> moving the freelist_lck to the start of the structure, followed by the
> latches, and leaving victimbuf_lck where it is, does that help
> performance?

As per discussion, I have kept them as it is and added a comment
indicating that we can consider having both locks in separate
cache lines.

> +            /*
> +             * If the buffer is pinned or has a nonzero usage_count,
> we cannot use
> +             * it; discard it and retry.  (This can only happen if
VACUUM put a
> +             * valid buffer in the freelist and then someone else
> used it before
> +             * we got to it.  It's probably impossible altogether as
> of 8.3, but
> +             * we'd better check anyway.)
> +             */
> +
>
> This comment is clearly obsolete.

Removed.

> > I have not yet added statistics (buffers_backend_clocksweep) as
> > for that we need to add one more variable in BufferStrategyControl
> > structure where I have already added few variables for this patch.
> > I think it is important to have such a stat available via
> > pg_stat_bgwriter, but not sure if it is worth to make the structure
> > bit more bulky.
>
> I think it's worth it.

Okay added new statistic.

> > Another minor point is about changes in lwlock.h
> > lwlock.h
> > * if you remove a lock, consider leaving a gap in the numbering
> > * sequence for the benefit of DTrace and other external debugging
> > * scripts.
> >
> > As I have removed BufFreelist lock, I have adjusted the numbering
> > as well in lwlock.h.  There is a meesage on top of lock definitions
> > which suggest to leave gap if we remove any lock, however I was not
> > sure whether this case (removing the first element) can effect anything,
> > so for now, I have adjusted the numbering.
>
> Let's leave slot 0 unused, instead.

Sure, that make sense.

Apart from above, I think for this patch, cat version bump is required
as I have modified system catalog.  However I have not done the
same in patch as otherwise it will be bit difficult to take performance
data.


Performance Data with updated patch

Configuration and Db Details

IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
Database Locale =C
checkpoint_segments=256
checkpoint_timeout    =15min
shared_buffers=8GB
scale factor = 3000
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins

All the data is in tps and taken using pgbench read-only load


  Client Count/Patch_Ver (tps) 8 16 32 64 128  HEAD 58614 107370 140717
104357 65010  Patch 60092 113564 165014 213848 216065


This data is median of 3 runs, detailed report is attached with mail.
I have not repeated the test for all configurations, as there is no
major change in design/algorithm which can effect performance.
Mark has already taken tpc-b data which ensures that there is
no problem with it, however I will also take it once with latest version.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

scalable_buffer_eviction_v6.patch
Description: Binary data

perf_read_scalability_data_v6.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

Reply via email to