so 13. 6. 2020 v 1:28 odesílatel Andres Freund <and...@anarazel.de> napsal:
> Hi, > > Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes > the resulting timing enough that the times aren't meaningful. E.g. > > CREATE TABLE lotsarows(key int not null); > INSERT INTO lotsarows SELECT generate_series(1, 50000000); > VACUUM FREEZE lotsarows; > > > -- best of three: > SELECT count(*) FROM lotsarows; > Time: 1923.394 ms (00:01.923) > > -- best of three: > EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; > Time: 2319.830 ms (00:02.320) > > -- best of three: > EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 4202.649 ms (00:04.203) > > That nearly *double* the execution time without TIMING. > > > Looking at a profile of this shows that we spend a good bit of cycles > "normalizing" timstamps etc. That seems pretty unnecessary, just forced > on us due to struct timespec. So the first attached patch just turns > instr_time to be a 64bit integer, counting nanoseconds. > > That helps, a tiny bit: > EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 4179.302 ms (00:04.179) > > but obviously doesn't move the needle. > > > Looking at a profile it's easy to confirm that we spend a lot of time > acquiring time: > - 95.49% 0.00% postgres postgres [.] > agg_retrieve_direct (inlined) > - agg_retrieve_direct (inlined) > - 79.27% fetch_input_tuple > - ExecProcNode (inlined) > - 75.72% ExecProcNodeInstr > + 25.22% SeqNext > - 21.74% InstrStopNode > + 17.80% __GI___clock_gettime (inlined) > - 21.44% InstrStartNode > + 19.23% __GI___clock_gettime (inlined) > + 4.06% ExecScan > + 13.09% advance_aggregates (inlined) > 1.06% MemoryContextReset > > And that's even though linux avoids a syscall (in most cases) etc to > acquire the time. Unless the kernel detects there's a reason not to do > so, linux does this by using 'rdtscp' and multiplying it by kernel > provided factors to turn the cycles into time. > > Some of the time is spent doing function calls, dividing into struct > timespec, etc. But most of it just the rdtscp instruction: > 65.30 │1 63: rdtscp > > > The reason for that is largely that rdtscp waits until all prior > instructions have finished (but it allows later instructions to already > start). Multiple times for each tuple. > > > In the second attached prototype patch I've change instr_time to count > in cpu cycles instead of nanoseconds. And then just turned the cycles > into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later). > > When using rdtsc that results in *vastly* lower overhead: > > ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ > │ QUERY PLAN > │ > > ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ > │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual > time=2610.235..2610.235 rows=1 loops=1) │ > │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 > width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │ > │ Planning Time: 0.028 ms > │ > │ Execution Time: 2610.256 ms > │ > > └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ > (4 rows) > > Time: 2610.589 ms (00:02.611) > > And there's still some smaller improvements that could be made ontop of > that. > > As a comparison, here's the time when using rdtscp directly in > instr_time, instead of going through clock_gettime: > Time: 3481.162 ms (00:03.481) > > That shows pretty well how big the cost of the added pipeline stalls > are, and how important out-of-order execution is for decent > performance... > > > In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we > do *not* want to wait for prior instructions to finish, since that > actually leads to the timing being less accurate, rather than > more. There are other cases where that'd be different, e.g. measuring > how long an entire query takes or such (but there it's probably > irrelevant which to use). > > > I've above skipped a bit over the details of how to turn the cycles > returned by rdtsc into time: > > On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that > have actually been run, but instead returns the number of 'reference > cycles'. That's important because otherwise things like turbo mode and > lower power modes would lead to completely bogus times. > > Thus, knowing the "base frequency" of the CPU allows to turn the > difference between two rdtsc return values into seconds. > > In the attached prototype I just determined the number of cycles using > cpuid(0x16). That's only available since Skylake (I think). On older > CPUs we'd have to look at /proc/cpuinfo or > /sys/devices/system/cpu/cpu0/cpufreq/base_frequency. > > > There's also other issues with using rdtsc directly: On older CPUs, in > particular older multi-socket systems, the tsc will not be synchronized > in detail across cores. There's bits that'd let us check whether tsc is > suitable or not. The more current issue of that is that things like > virtual machines being migrated can lead to rdtsc suddenly returning a > different value / the frequency differening. But that is supposed to be > solved these days, by having virtualization technologies set frequency > multipliers and offsets which then cause rdtsc[p] to return something > meaningful, even after migration. > > > The attached patches are really just a prototype. I'm also not really > planning to work on getting this into a "production ready" patchset > anytime soon. I developed it primarily because I found it the overhead > made it too hard to nail down in which part of a query tree performance > changed. If somebody else wants to continue from here... > > I do think it's be a pretty significant improvement if we could reduce > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a > bunch of low-level code. > +1 Pavel > Greetings, > > Andres Freund >