On 2020/06/29 22:23, Ants Aasma wrote:
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud <rjuju...@gmail.com 
<mailto:rjuju...@gmail.com>> wrote:

    On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
    <masao.fu...@oss.nttdata.com <mailto:masao.fu...@oss.nttdata.com>> wrote:
     >
     > On 2020/06/29 16:05, Julien Rouhaud wrote:
     > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins <tha...@amazon.com 
<mailto:tha...@amazon.com>> wrote:
     > >>
     > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 
shows
     >
     > Thanks for the benchmark!
     >
     >
     > >> ~45% performance drop [2] at high DB connection counts (when compared 
with v12.3)
     >
     > That's bad :(
     >
     >
     > >>
     > >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
     > >> brings the TPS numbers up to v12.3 levels.
     > >>
     > >> The inflection point (in this test-case) is 128 Connections, beyond 
which the
     > >> TPS numbers are consistently low. Looking at the mailing list [1], 
this issue
     > >> didn't surface earlier possibly since the regression is trivial at 
low connection counts.
     > >>
     > >> It would be great if this could be optimized further, or 
track_planning
     > >> disabled (by default) so as to not trip users upgrading from v12 with 
pg_stat_statement
     > >> enabled (but otherwise not particularly interested in track_planning).
     >
     > Your benchmark result seems to suggest that the cause of the problem is
     > the contention of per-query spinlock in pgss_store(). Right?
     > This lock contention is likely to happen when multiple sessions run
     > the same queries.
     >
     > One idea to reduce that lock contention is to separate per-query spinlock
     > into two; one is for planning, and the other is for execution. 
pgss_store()
     > determines which lock to use based on the given "kind" argument.
     > To make this idea work, also every pgss counters like shared_blks_hit
     > need to be separated into two, i.e., for planning and execution.

    This can probably remove some overhead, but won't it eventually hit
    the same issue when multiple connections try to plan the same query,
    given the number of different queries and very low execution runtime?
    It'll also quite increase the shared memory consumption.

    I'm wondering if we could instead use atomics to store the counters.
    The only downside is that we won't guarantee per-row consistency
    anymore, which may be problematic.



The problem looks to be that spinlocks are terrible with overloaded CPU and a 
contended spinlock. A process holding the spinlock might easily get scheduled 
out leading to excessive spinning by everybody. I think a simple thing to try 
would be to replace the spinlock with LWLock.

Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.


I did a prototype patch that replaces spinlocks with futexes, but was not able 
to find a workload where it mattered.

I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?

We have done a great job at eliminating spinlocks from contended code paths. 
Robins, perhaps you could try it to see if it reduces the regression you are 
observing.

Yes. Also we need to check that this change doesn't increase performance
overhead in other workloads.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index cef8bb5a49..aa506f6c11 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -39,7 +39,7 @@
  * in an entry except the counters requires the same.  To look up an entry,
  * one must hold the lock shared.  To read or update the counters within
  * an entry, one must hold the lock shared or exclusive (so the entry doesn't
- * disappear!) and also take the entry's mutex spinlock.
+ * disappear!) and also take the entry's partition lock.
  * The shared state variable pgss->extent (the next free spot in the external
  * query-text file) should be accessed only while holding either the
  * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
@@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM 
/ 100;
 
 #define JUMBLE_SIZE                            1024    /* query serialization 
buffer size */
 
+#define        PGSS_NUM_LOCK_PARTITIONS()              (pgss_max)
+#define        PGSS_HASH_PARTITION_LOCK(key)   \
+       (&(pgss->base + \
+          (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock)
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -207,7 +212,7 @@ typedef struct pgssEntry
        Size            query_offset;   /* query text offset in external file */
        int                     query_len;              /* # of valid bytes in 
query string, or -1 */
        int                     encoding;               /* query text encoding 
*/
-       slock_t         mutex;                  /* protects the counters only */
+       LWLock          *lock;                  /* protects the counters only */
 } pgssEntry;
 
 /*
@@ -216,6 +221,7 @@ typedef struct pgssEntry
 typedef struct pgssSharedState
 {
        LWLock     *lock;                       /* protects hashtable 
search/modification */
+       LWLockPadded    *base;          /* base address of this LWLock */
        double          cur_median_usage;       /* current median usage in 
hashtable */
        Size            mean_query_len; /* current mean entry text length */
        slock_t         mutex;                  /* protects following fields 
only: */
@@ -468,7 +474,8 @@ _PG_init(void)
         * resources in pgss_shmem_startup().
         */
        RequestAddinShmemSpace(pgss_memsize());
-       RequestNamedLWLockTranche("pg_stat_statements", 1);
+       RequestNamedLWLockTranche("pg_stat_statements",
+                                                         
PGSS_NUM_LOCK_PARTITIONS() + 1);
 
        /*
         * Install hooks.
@@ -547,7 +554,8 @@ pgss_shmem_startup(void)
        if (!found)
        {
                /* First time through ... */
-               pgss->lock = 
&(GetNamedLWLockTranche("pg_stat_statements"))->lock;
+               pgss->base = GetNamedLWLockTranche("pg_stat_statements");
+               pgss->lock = &(pgss->base + pgss_max)->lock;
                pgss->cur_median_usage = ASSUMED_MEDIAN_INIT;
                pgss->mean_query_len = ASSUMED_LENGTH_INIT;
                SpinLockInit(&pgss->mutex);
@@ -1376,14 +1384,14 @@ pgss_store(const char *query, uint64 queryId,
        if (!jstate)
        {
                /*
-                * Grab the spinlock while updating the counters (see comment 
about
-                * locking rules at the head of the file)
+                * Grab the partition lock while updating the counters (see 
comment
+                * about locking rules at the head of the file)
                 */
                volatile pgssEntry *e = (volatile pgssEntry *) entry;
 
                Assert(kind == PGSS_PLAN || kind == PGSS_EXEC);
 
-               SpinLockAcquire(&e->mutex);
+               LWLockAcquire(e->lock, LW_EXCLUSIVE);
 
                /* "Unstick" entry if it was previously sticky */
                if (IS_STICKY(e->counters))
@@ -1435,7 +1443,7 @@ pgss_store(const char *query, uint64 queryId,
                e->counters.wal_fpi += walusage->wal_fpi;
                e->counters.wal_bytes += walusage->wal_bytes;
 
-               SpinLockRelease(&e->mutex);
+               LWLockRelease(e->lock);
        }
 
 done:
@@ -1762,9 +1770,9 @@ pg_stat_statements_internal(FunctionCallInfo fcinfo,
                {
                        volatile pgssEntry *e = (volatile pgssEntry *) entry;
 
-                       SpinLockAcquire(&e->mutex);
+                       LWLockAcquire(e->lock, LW_SHARED);
                        tmp = e->counters;
-                       SpinLockRelease(&e->mutex);
+                       LWLockRelease(e->lock);
                }
 
                /* Skip entry if unexecuted (ie, it's a pending "sticky" entry) 
*/
@@ -1908,8 +1916,8 @@ entry_alloc(pgssHashKey *key, Size query_offset, int 
query_len, int encoding,
                memset(&entry->counters, 0, sizeof(Counters));
                /* set the appropriate initial usage count */
                entry->counters.usage = sticky ? pgss->cur_median_usage : 
USAGE_INIT;
-               /* re-initialize the mutex each time ... we assume no one using 
it */
-               SpinLockInit(&entry->mutex);
+               /* determine which partition lock to use for this query */
+               entry->lock = PGSS_HASH_PARTITION_LOCK(key);
                /* ... and don't forget the query text metadata */
                Assert(query_len >= 0);
                entry->query_offset = query_offset;

Reply via email to