Hi,

On 2023-06-30 16:28:59 -0500, David Christensen wrote:
> On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <and...@anarazel.de> wrote:
> >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some 
> >> queries
> > are the same, sobut others regress by up to 70% (although more commonly 
> > around
> > 10-20%).
>
> Hmm, that is definitely not good.
>
> > That's larger than I thought, which makes me suspect that there's some bug 
> > in
> > the new code.
>
> Will do a little profiling here to see if I can figure out the
> regression. Which build optimization settings are you seeing this
> under?

gcc 12 with:

meson setup \
  -Doptimization=3 -Ddebug=true \
  -Dc_args="-ggdb -g3 -march=native -mtune=native -fno-plt 
-fno-semantic-interposition -Wno-array-bounds" \
  -Dc_link_args="-fuse-ld=mold -Wl,--gdb-index,--Bsymbolic" \
  ...

Relevant postgres settings:
-c huge_pages=on -c shared_buffers=12GB -c max_connections=120
-c work_mem=32MB
-c autovacuum=0 # I always do that for comparative benchmarks, too much variance
-c track_io_timing=on

The later run where I saw the smaller regression was with work_mem=1GB.  I
just had forgotten to adjust that.


I had loaded tpch scale 5 before, which is why I just used that.


FWIW, even just "SELECT count(*) FROM lineitem;" shows a substantial
regression.

I disabled parallelism, prewarmed the data and pinned postgres to a single
core to reduce noise. The result is the best of three (variance was low in all
cases).

                 HEAD       patch
index only scan  1896.364   2242.288
seq scan         1586.990   1628.042


A profile shows that 20% of the runtime in the IOS case is in
visibilitymap_get_status():

+   20.50%  postgres.new  postgres.new          [.] visibilitymap_get_status
+   19.54%  postgres.new  postgres.new          [.] ExecInterpExpr
+   14.47%  postgres.new  postgres.new          [.] IndexOnlyNext
+    6.47%  postgres.new  postgres.new          [.] index_deform_tuple_internal
+    4.67%  postgres.new  postgres.new          [.] ExecScan
+    4.12%  postgres.new  postgres.new          [.] btgettuple
+    3.97%  postgres.new  postgres.new          [.] ExecAgg
+    3.92%  postgres.new  postgres.new          [.] _bt_next
+    3.71%  postgres.new  postgres.new          [.] _bt_readpage
+    3.43%  postgres.new  postgres.new          [.] fetch_input_tuple
+    2.87%  postgres.new  postgres.new          [.] index_getnext_tid
+    2.45%  postgres.new  postgres.new          [.] MemoryContextReset
+    2.35%  postgres.new  postgres.new          [.] tts_virtual_clear
+    1.37%  postgres.new  postgres.new          [.] index_deform_tuple
+    1.14%  postgres.new  postgres.new          [.] ExecStoreVirtualTuple
+    1.13%  postgres.new  postgres.new          [.] PredicateLockPage
+    1.12%  postgres.new  postgres.new          [.] int8inc
+    1.04%  postgres.new  postgres.new          [.] ExecIndexOnlyScan
+    0.57%  postgres.new  postgres.new          [.] BufferGetBlockNumber

mostly due to

       │      BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
  2.46 │        lea    -0x60(,%rdx,4),%rcx
       │        xor    %edx,%edx
 59.79 │        div    %rcx


You can't have have divisions for this kind of thing in the vicinity of a
peformance critical path. With compile time constants the compiler can turn
this into shifts, but that's not possible as-is after the change.

While not quite as bad as divisions, the paths with multiplications are also
not going to be ok. E.g.
                return (Block) (BufferBlocks + ((Size) (buffer - 1)) * 
CLUSTER_BLOCK_SIZE);
is going to be noticeable.


You'd have to turn all of this into shifts (and enforce power of 2 sizes, if
you aren't yet).


I don't think pre-computed tables are a viable answer FWIW. Even just going
through a memory indirection is going to be noticable. This stuff is in a
crapton of hot code paths.

Greetings,

Andres Freund


Reply via email to