[PATCHES] [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

Martijn van Oosterhout Tue, 09 May 2006 13:37:40 -0700

This was a suggestion made back in March that would dramatically reduce
the overhead of EXPLAIN ANALYZE on queries that loop continuously over
the same nodes.


http://archives.postgresql.org/pgsql-hackers/2006-03/msg01114.php

What it does behave normally for the first 50 tuples of any node, but
after that it starts sampling at ever increasing intervals, the
intervals controlled by an exponential function. So for a node
iterating over 1 million tuples it takes around 15,000 samples. The
result is that EXPLAIN ANALYZE has a much reduced effect on the total
execution time.

Without EXPLAIN ANALYZE:

postgres=# select count(*) from generate_series(1,1000000);
  count  
---------
 1000000
(1 row)

Time: 2303.599 ms

EXPLAIN ANALYZE without patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
                                                             QUERY PLAN         
                                                    
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15.00..15.01 rows=1 width=0) (actual time=8022.070..8022.073 
rows=1 loops=1)
   ->  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=0) 
(actual time=1381.762..4873.369 rows=1000000 loops=1)
 Total runtime: 8042.472 ms
(3 rows)

Time: 8043.401 ms

EXPLAIN ANALYZE with patch:

postgres=# explain analyze select count(*) from generate_series(1,1000000);
                                                             QUERY PLAN         
                                                    
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15.00..15.01 rows=1 width=0) (actual time=2469.491..2469.494 
rows=1 loops=1)
   ->  Function Scan on generate_series  (cost=0.00..12.50 rows=1000 width=0) 
(actual time=1405.002..2187.417 rows=1000000 loops=1)
 Total runtime: 2496.529 ms
(3 rows)

Time: 2497.488 ms

As you can see, the overhead goes from 5.7 seconds to 0.2 seconds.
Obviously this is an extreme case, but it will probably help in a lot
of other cases people have been complaining about.

- To get this close it needs to get an estimate of the sampling overhead.
It does this by a little calibration loop that is run once per backend.
If you don't do this, you end up assuming all tuples take the same time
as tuples with the overhead, resulting in nodes apparently taking
longer than their parent nodes. Incidently, I measured the overhead to
be about 3.6us per tuple per node on my (admittedly slightly old)
machine.

Note that the resulting times still include the overhead actually
incurred, I didn't filter it out. I want the times to remain reflecting
reality as closely as possible.

- I also removed InstrStopNodeMulti and made InstrStopNode take a tuple
count parameter instead. This is much clearer all round.

- I also didn't make it optional. I'm unsure about whether it should be
optional or not, given the number of cases where it will make a
difference to be very few.

- The tuple counter for sampling restarts every loop. Thus a node that is
called repeatedly only returning one value each time will still measure
every tuple, though its parent node won't. We'll need some field
testing to see if that remains a significant effect.

- I don't let the user know anywhere how many samples it took. Is this
something users should care about?

Any comments?
-- 
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to 
> litigate.

Index: src/backend/commands/trigger.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/trigger.c,v
retrieving revision 1.201
diff -c -r1.201 trigger.c
*** src/backend/commands/trigger.c      27 Apr 2006 00:33:41 -0000      1.201
--- src/backend/commands/trigger.c      9 May 2006 20:35:07 -0000
***************
*** 1306,1312 ****
         * one "tuple returned" (really the number of firings).
         */
        if (instr)
!               InstrStopNode(instr + tgindx, true);
  
        return (HeapTuple) DatumGetPointer(result);
  }
--- 1306,1312 ----
         * one "tuple returned" (really the number of firings).
         */
        if (instr)
!               InstrStopNode(instr + tgindx, 1);
  
        return (HeapTuple) DatumGetPointer(result);
  }
***************
*** 2154,2160 ****
         * one "tuple returned" (really the number of firings).
         */
        if (instr)
!               InstrStopNode(instr + tgindx, true);
  }
  
  
--- 2154,2160 ----
         * one "tuple returned" (really the number of firings).
         */
        if (instr)
!               InstrStopNode(instr + tgindx, 1);
  }
  
  
Index: src/backend/executor/execProcnode.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/execProcnode.c,v
retrieving revision 1.54
diff -c -r1.54 execProcnode.c
*** src/backend/executor/execProcnode.c 5 Mar 2006 15:58:26 -0000       1.54
--- src/backend/executor/execProcnode.c 9 May 2006 20:35:07 -0000
***************
*** 423,429 ****
        }
  
        if (node->instrument)
!               InstrStopNode(node->instrument, !TupIsNull(result));
  
        return result;
  }
--- 423,429 ----
        }
  
        if (node->instrument)
!               InstrStopNode(node->instrument, TupIsNull(result) ? 0 : 1 );
  
        return result;
  }
Index: src/backend/executor/instrument.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/instrument.c,v
retrieving revision 1.14
diff -c -r1.14 instrument.c
*** src/backend/executor/instrument.c   5 Mar 2006 15:58:26 -0000       1.14
--- src/backend/executor/instrument.c   9 May 2006 20:35:07 -0000
***************
*** 14,22 ****
--- 14,84 ----
  #include "postgres.h"
  
  #include <unistd.h>
+ #include <math.h>
  
  #include "executor/instrument.h"
  
+ /* This is the function that is used to determine the sampling intervals. In
+  * general, if the function is f(x), then for N tuples we will take on the
+  * order of integral(1/f(x), x=0..N) samples. Some examples follow, with the
+  * number of samples that would be collected over 1,000,000 tuples.
+ 
+   f(x) = x         =>   log2(N)                        20
+   f(x) = x^(1/2)   =>   2 * N^(1/2)                  2000
+   f(x) = x^(1/3)   =>   1.5 * N^(2/3)               15000
+   
+  * I've chosen the last one as it seems to provide a good compromise between
+  * low overhead but still getting a meaningful number of samples. However,
+  * not all machines have the cbrt() function so on those we substitute
+  * sqrt(). The difference is not very significant in the tests I made.
+ */ 
+ #ifdef HAVE_CBRT
+ #define SampleFunc cbrt
+ #else
+ #define SampleFunc sqrt
+ #endif
+ 
+ #define SAMPLE_THRESHOLD 50
+ 
+ static double SampleOverhead;
+ static bool SampleOverheadCalculated;
+ 
+ static void
+ CalculateSampleOverhead()
+ {
+       Instrumentation instr;
+       int i;
+       
+       /* We want to determine the sampling overhead, to correct
+        * calculations later. This only needs to be done once per backend.
+        * Is this the place? A wrong value here (due to a mistimed
+        * task-switch) will cause bad calculations later.
+        *
+        * To minimize the risk we do it a few times and take the lowest.
+        */
+       
+       SampleOverhead = 1.0e6;
+       
+       for( i = 0; i<5; i++ )
+       {
+               int j;
+               double overhead;
+               
+               memset( &instr, 0, sizeof(instr) );
+       
+               /* Loop SAMPLE_THRESHOLD times or 100 microseconds, whichever 
is faster */
+               for( j=0; j<SAMPLE_THRESHOLD && 
INSTR_TIME_GET_DOUBLE(instr.counter) < 100e-6; i++ )
+               {
+                       InstrStartNode( &instr );
+                       InstrStopNode( &instr, 1 );
+               }
+               overhead = INSTR_TIME_GET_DOUBLE(instr.counter) / 
instr.samplecount;
+               if( overhead < SampleOverhead )
+                       SampleOverhead = overhead;
+       }
+               
+       SampleOverheadCalculated = true;
+ }
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
***************
*** 25,31 ****
        Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
  
        /* we don't need to do any initialization except zero 'em */
! 
        return instr;
  }
  
--- 87,96 ----
        Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
  
        /* we don't need to do any initialization except zero 'em */
!       
!       /* Calculate overhead, if not done yet */
!       if( !SampleOverheadCalculated )
!               CalculateSampleOverhead();
        return instr;
  }
  
***************
*** 34,82 ****
  InstrStartNode(Instrumentation *instr)
  {
        if (INSTR_TIME_IS_ZERO(instr->starttime))
!               INSTR_TIME_SET_CURRENT(instr->starttime);
        else
                elog(DEBUG2, "InstrStartNode called twice in a row");
  }
  
  /* Exit from a plan node */
  void
! InstrStopNode(Instrumentation *instr, bool returnedTuple)
  {
        instr_time      endtime;
  
        /* count the returned tuples */
!       if (returnedTuple)
!               instr->tuplecount += 1;
  
!       if (INSTR_TIME_IS_ZERO(instr->starttime))
        {
!               elog(DEBUG2, "InstrStopNode called without start");
!               return;
!       }
  
!       INSTR_TIME_SET_CURRENT(endtime);
  
  #ifndef WIN32
!       instr->counter.tv_sec += endtime.tv_sec - instr->starttime.tv_sec;
!       instr->counter.tv_usec += endtime.tv_usec - instr->starttime.tv_usec;
  
!       /* Normalize after each add to avoid overflow/underflow of tv_usec */
!       while (instr->counter.tv_usec < 0)
!       {
!               instr->counter.tv_usec += 1000000;
!               instr->counter.tv_sec--;
!       }
!       while (instr->counter.tv_usec >= 1000000)
!       {
!               instr->counter.tv_usec -= 1000000;
!               instr->counter.tv_sec++;
!       }
  #else                                                 /* WIN32 */
!       instr->counter.QuadPart += (endtime.QuadPart - 
instr->starttime.QuadPart);
  #endif
  
!       INSTR_TIME_SET_ZERO(instr->starttime);
  
        /* Is this the first tuple of this cycle? */
        if (!instr->running)
--- 99,167 ----
  InstrStartNode(Instrumentation *instr)
  {
        if (INSTR_TIME_IS_ZERO(instr->starttime))
!       {
!               /* We always sample the first SAMPLE_THRESHOLD tuples, so small 
nodes are always accurate */
!               if (instr->tuplecount < SAMPLE_THRESHOLD)
!                       instr->sampling = true;
!               else
!               {
!                       /* Otherwise we go to sampling, see the comments on 
SampleFunc at the top of the file */
!                       if( instr->tuplecount > instr->nextsample )
!                       {
!                               instr->sampling = true;
!                               /* The doubling is so the random will average 1 
over time */
!                               instr->nextsample += 2.0 * 
SampleFunc(instr->tuplecount) * (double)rand() / (double)RAND_MAX;
!                       }
!               }
!               if (instr->sampling)
!                       INSTR_TIME_SET_CURRENT(instr->starttime);
!       }
        else
                elog(DEBUG2, "InstrStartNode called twice in a row");
  }
  
  /* Exit from a plan node */
  void
! InstrStopNode(Instrumentation *instr, double nTuples)
  {
        instr_time      endtime;
  
        /* count the returned tuples */
!       instr->tuplecount += nTuples;
  
!       if (instr->sampling)
        {
!               if (INSTR_TIME_IS_ZERO(instr->starttime))
!               {
!                       elog(DEBUG2, "InstrStopNode called without start");
!                       return;
!               }
  
!               INSTR_TIME_SET_CURRENT(endtime);
  
  #ifndef WIN32
!               instr->counter.tv_sec += endtime.tv_sec - 
instr->starttime.tv_sec;
!               instr->counter.tv_usec += endtime.tv_usec - 
instr->starttime.tv_usec;
  
!               /* Normalize after each add to avoid overflow/underflow of 
tv_usec */
!               while (instr->counter.tv_usec < 0)
!               {
!                       instr->counter.tv_usec += 1000000;
!                       instr->counter.tv_sec--;
!               }
!               while (instr->counter.tv_usec >= 1000000)
!               {
!                       instr->counter.tv_usec -= 1000000;
!                       instr->counter.tv_sec++;
!               }
  #else                                                 /* WIN32 */
!               instr->counter.QuadPart += (endtime.QuadPart - 
instr->starttime.QuadPart);
  #endif
  
!               INSTR_TIME_SET_ZERO(instr->starttime);
!               instr->samplecount += nTuples;
!               instr->sampling = false;
!       }
  
        /* Is this the first tuple of this cycle? */
        if (!instr->running)
***************
*** 86,102 ****
        }
  }
  
- /* As above, but count multiple tuples returned at once */
- void
- InstrStopNodeMulti(Instrumentation *instr, double nTuples)
- {
-       /* count the returned tuples */
-       instr->tuplecount += nTuples;
- 
-       /* delegate the rest */
-       InstrStopNode(instr, false);
- }
- 
  /* Finish a run cycle for a plan node */
  void
  InstrEndLoop(Instrumentation *instr)
--- 171,176 ----
***************
*** 114,121 ****
        totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
  
        instr->startup += instr->firsttuple;
!       instr->total += totaltime;
        instr->ntuples += instr->tuplecount;
        instr->nloops += 1;
  
        /* Reset for next cycle (if any) */
--- 188,211 ----
        totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
  
        instr->startup += instr->firsttuple;
! 
!       /* Here we take into account sampling effects. Doing it naively ends
!        * up assuming the sampling overhead applies to all tuples, even the
!        * ones we didn't measure. We've calculated an overhead, so we
!        * subtract that for all samples we didn't measure. The first tuple
!        * is also special cased, because it usually takes longer. */
! 
!       if( instr->samplecount < instr->tuplecount )
!       {
!               double pertuple = (totaltime - instr->firsttuple) / 
(instr->samplecount - 1);
!               instr->total += instr->firsttuple + (pertuple * 
(instr->samplecount - 1))
!                                                 + ((pertuple - 
SampleOverhead) * (instr->tuplecount - instr->samplecount));
!       }
!       else
!               instr->total += totaltime;
!               
        instr->ntuples += instr->tuplecount;
+       instr->nsamples += instr->samplecount;
        instr->nloops += 1;
  
        /* Reset for next cycle (if any) */
***************
*** 123,127 ****
--- 213,218 ----
        INSTR_TIME_SET_ZERO(instr->starttime);
        INSTR_TIME_SET_ZERO(instr->counter);
        instr->firsttuple = 0;
+       instr->samplecount = 0;
        instr->tuplecount = 0;
  }
Index: src/backend/executor/nodeBitmapAnd.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapAnd.c,v
retrieving revision 1.6
diff -c -r1.6 nodeBitmapAnd.c
*** src/backend/executor/nodeBitmapAnd.c        5 Mar 2006 15:58:26 -0000       
1.6
--- src/backend/executor/nodeBitmapAnd.c        9 May 2006 20:35:07 -0000
***************
*** 163,169 ****
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNodeMulti(node->ps.instrument, 0 /* XXX */ );
  
        return (Node *) result;
  }
--- 163,169 ----
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNode(node->ps.instrument, 0 /* XXX */ );
  
        return (Node *) result;
  }
Index: src/backend/executor/nodeBitmapIndexscan.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapIndexscan.c,v
retrieving revision 1.17
diff -c -r1.17 nodeBitmapIndexscan.c
*** src/backend/executor/nodeBitmapIndexscan.c  5 Mar 2006 15:58:26 -0000       
1.17
--- src/backend/executor/nodeBitmapIndexscan.c  9 May 2006 20:35:07 -0000
***************
*** 112,118 ****
  
        /* must provide our own instrumentation support */
        if (node->ss.ps.instrument)
!               InstrStopNodeMulti(node->ss.ps.instrument, nTuples);
  
        return (Node *) tbm;
  }
--- 112,118 ----
  
        /* must provide our own instrumentation support */
        if (node->ss.ps.instrument)
!               InstrStopNode(node->ss.ps.instrument, nTuples);
  
        return (Node *) tbm;
  }
Index: src/backend/executor/nodeBitmapOr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeBitmapOr.c,v
retrieving revision 1.5
diff -c -r1.5 nodeBitmapOr.c
*** src/backend/executor/nodeBitmapOr.c 5 Mar 2006 15:58:26 -0000       1.5
--- src/backend/executor/nodeBitmapOr.c 9 May 2006 20:35:07 -0000
***************
*** 179,185 ****
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNodeMulti(node->ps.instrument, 0 /* XXX */ );
  
        return (Node *) result;
  }
--- 179,185 ----
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNode(node->ps.instrument, 0 /* XXX */ );
  
        return (Node *) result;
  }
Index: src/backend/executor/nodeHash.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/nodeHash.c,v
retrieving revision 1.101
diff -c -r1.101 nodeHash.c
*** src/backend/executor/nodeHash.c     5 Mar 2006 15:58:26 -0000       1.101
--- src/backend/executor/nodeHash.c     9 May 2006 20:35:07 -0000
***************
*** 97,103 ****
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNodeMulti(node->ps.instrument, hashtable->totalTuples);
  
        /*
         * We do not return the hash table directly because it's not a subtype 
of
--- 97,103 ----
  
        /* must provide our own instrumentation support */
        if (node->ps.instrument)
!               InstrStopNode(node->ps.instrument, hashtable->totalTuples);
  
        /*
         * We do not return the hash table directly because it's not a subtype 
of
Index: src/include/executor/instrument.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/executor/instrument.h,v
retrieving revision 1.13
diff -c -r1.13 instrument.h
*** src/include/executor/instrument.h   5 Mar 2006 15:58:56 -0000       1.13
--- src/include/executor/instrument.h   9 May 2006 20:35:09 -0000
***************
*** 61,77 ****
        instr_time      counter;                /* Accumulated runtime for this 
node */
        double          firsttuple;             /* Time for first tuple of this 
cycle */
        double          tuplecount;             /* Tuples emitted so far this 
cycle */
        /* Accumulated statistics across all completed cycles: */
        double          startup;                /* Total startup time (in 
seconds) */
        double          total;                  /* Total total time (in 
seconds) */
        double          ntuples;                /* Total tuples produced */
        double          nloops;                 /* # of run cycles for this 
node */
  } Instrumentation;
  
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
! extern void InstrStopNode(Instrumentation *instr, bool returnedTuple);
! extern void InstrStopNodeMulti(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
  
  #endif   /* INSTRUMENT_H */
--- 61,81 ----
        instr_time      counter;                /* Accumulated runtime for this 
node */
        double          firsttuple;             /* Time for first tuple of this 
cycle */
        double          tuplecount;             /* Tuples emitted so far this 
cycle */
+       double          samplecount;            /* Samples collected this cycle 
*/
        /* Accumulated statistics across all completed cycles: */
        double          startup;                /* Total startup time (in 
seconds) */
        double          total;                  /* Total total time (in 
seconds) */
        double          ntuples;                /* Total tuples produced */
        double          nloops;                 /* # of run cycles for this 
node */
+       double          nsamples;               /* # of samples taken */
+       /* Tracking for sampling */
+       bool            sampling;               /* Are we sampling this 
iteration */
+       double          nextsample;             /* The next tuplecount we're 
going to sample */
  } Instrumentation;
  
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
! extern void InstrStopNode(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
  
  #endif   /* INSTRUMENT_H */

signature.asc
Description: Digital signature

[PATCHES] [PATCH] Improve EXPLAIN ANALYZE overhead by sampling

Reply via email to