I feel like we must be missing something here.  If Dilip is seeing
huge speedups and you're seeing nothing, something is different, and
we don't know what it is.  Even if the test case is artificial, it
ought to be the same when one of you runs it as when the other runs
it.  Right?

Yes, definitely - we're missing something important, I think. One difference
is that Dilip is using longer runs, but I don't think that's a problem (as I
demonstrated how stable the results are).

It's not impossible that the longer runs could matter - performance
isn't necessarily stable across time during a pgbench test, and the
longer the run the more CLOG pages it will fill.

Sure, but I'm not doing just a single pgbench run. I do a sequence of pgbench runs, with different client counts, with ~6h of total runtime. There's a checkpoint in between the runs, but as those benchmarks are on unlogged tables, that flushes only very few buffers.

Also, the clog SLRU has 128 pages, which is ~1MB of clog data, i.e. ~4M transactions. On some kernels (3.10 and 3.12) I can get >50k tps with 64 clients or more, which means we fill the 128 pages in less than 80 seconds.

So half-way through the run only 50% of clog pages fits into the SLRU, and we have a data set with 30M tuples, with uniform random access - so it seems rather unlikely we'll get transaction that's still in the SLRU.

But sure, I can do a run with larger data set to verify this.

I wonder what CPU model is Dilip using - I know it's x86, but not which
generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
model and it makes a difference (although that seems unlikely).

The fact that he's using an 8-socket machine seems more likely to
matter than the CPU generation, which isn't much different.  Maybe
Dilip should try this on a 2-socket machine and see if he sees the
same kinds of results.

Maybe. I wouldn't expect a major difference between 4 and 8 sockets, but I may be wrong.


