On Mon, Oct 24, 2011 at 4:23 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> I wonder how trustworthy the measure of the visibilitymap_test call site >>> as a consumer of cycles really is. > >> I'm not sure either. I guess we could try short-circuiting >> visibilitymap_test and see what that does to performance (let's leave >> correct answers out of it). > > That would conflate the cost of the call with the cost of the function. > Maybe you could try manually inlining the visibility test?
I fooled around with this some more on my Fedora 14 desktop machine. pgbench database, scale factor 20, shared_buffers=400MB. I ran the query "select count(*) from pgbench_accounts". I'm having a hard time getting completely reproducible results, but it appears that warming the cache makes the sequential scan go faster drop from maybe 390 ms to 245 ms, and an index-only scan takes about 350 ms, so which one is better depends a lot on your assumptions about what is going on on the system at the same time (which means maybe we ought not to sweat it). If I wipe out the whole if-block that calls visibilitymap_test(), the index-only scan drops down to about 300 ms (and delivers potentially wrong answers, of course). Inlining visibilitymap_test (see attached vismap-inline.patch) causes the index-only scan to drop to about 330 ms. I also tried changing the BufferIsValid() tests in visibilitymap_test() to use BufferIsInvalid() instead, with the sense of the tests reversed (see attached vismap-test-invalid.patch). Since BufferIsInvalid() just checks for InvalidBuffer instead of also doing the sanity checks, it's significantly cheaper. This also reduced the time to about 330 ms, so seems clearly worth doing. Apparently these changes don't stack, because doing both things only gets me down to about 320 ms, which is fairly unexciting for the amount of ugliness that inlining entails. I tried sprinkling a little branch-prediction magic on this code using GCC's __builtin_expect(). That initially seemed to help, but once I changed the BufferIsValid() test to !BufferIsInvalid() essentially all of the savings disappeared. I also spent some time poking through the opannotate -s -a output, which shows where time is being spent by individual assembler instruction, but also annotates the assembler listing with the original source code. At least according to oprofile, the time that is being spent in IndexOnlyNext() is mostly being spent on seemingly innocuous operations like saving and restoring registers. For example, much of the time being attributed to the visibilitymap_test() call in IndexOnlyNext() is actually attributable to the instructions that are calculating what address to pass for scandesc->heapRelation. Many but not all of the pointer deferences at the top of IndexOnlyNext() have a chunk cycles attributed to them, and while they're not that significant individually, they add up. Similarly, the long series of instructions to which index_getattr() resolves bleeds cycles at just about every step. There's not much to optimize there, though, unless we want to add some code that avoids decoding the tuple altogether in the particular case of a zero-argument aggregate, or maybe something more general that only pulls out the columns that are actually needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Description: Binary data
Description: Binary data
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers