On Sun, Dec 3, 2023 at 11:22 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > Thanks for the script. Are you also measuring the time it takes to > decode this using test_decoding? > > FWIW I did more comprehensive suite of tests over the weekend, with a > couple more variations. I'm attaching the updated scripts, running it > should be as simple as > > ./run.sh BRANCH TRANSACTIONS RUNS > > so perhaps > > ./run.sh master 1000 3 > > to do 3 runs with 1000 transactions per client. And it'll run a bunch of > combinations hard-coded in the script, and write the timings into a CSV > file (with "master" in each row). > > I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I > did this with current master, the basic patch (without the 0002 part), > and then with the optimized approach (single global hash table, see the > 0004 part). That's what master / patched / optimized in the results is. > > Interestingly enough, the i5 handled this much faster, it seems to be > better in single-core tasks. The xeon is still running, so the results > for "optimized" only have one run (out of 3), but shouldn't change much. > > Attached is also a table summarizing this, and visualizing the timing > change (vs. master) in the last couple columns. Green is "faster" than > master (but we don't really expect that), and "red" means slower than > master (the more red, the slower). > > There results are grouped by script (see the attached .tgz), with either > 32 or 96 clients (which does affect the timing, but not between master > and patch). Some executions have no pg_sleep() calls, some have 0.001 > wait (but that doesn't seem to make much difference). > > Overall, I'd group the results into about three groups: > > 1) good cases [nextval, nextval-40, nextval-abort] > > These are cases that slow down a bit, but the slowdown is mostly within > reasonable bounds (we're making the decoding to do more stuff, so it'd > be a bit silly to require that extra work to make no impact). And I do > think this is reasonable, because this is pretty much an extreme / worst > case behavior. People don't really do just nextval() calls, without > doing anything else. Not to mention doing aborts for 100% transactions. > > So in practice this is going to be within noise (and in those cases the > results even show speedup, which seems a bit surprising). It's somewhat > dependent on CPU too - on xeon there's hardly any regression. > > > 2) nextval-40-abort > > Here the slowdown is clear, but I'd argue it generally falls in the same > group as (1). Yes, I'd be happier if it didn't behave like this, but if > someone can show me a practical workload affected by this ... > > > 3) irrelevant cases [all the alters taking insane amounts of time] > > I absolutely refuse to care about these extreme cases where decoding > 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on > xeon). If this was a problem for some practical workload, we'd have > already heard about it I guess. And even if there was such workload, it > wouldn't be up to this patch to fix that. There's clearly something > misbehaving in the snapshot builder. > > > I was hopeful the global hash table would be an improvement, but that > doesn't seem to be the case. I haven't done much profiling yet, but I'd > guess most of the overhead is due to ReorderBufferQueueSequence() > starting and aborting a transaction in the non-transactinal case. Which > is unfortunate, but I don't know if there's a way to optimize that. >
Before discussing the alternative ideas you shared, let me try to clarify my understanding so that we are on the same page. I see two observations based on the testing and discussion we had (a) for non-transactional cases, the overhead observed is mainly due to starting/aborting a transaction for each change; (b) for transactional cases, we see overhead due to traversing all the top-level txns and check the hash table for each one to find whether change is transactional. Am, I missing something? -- With Regards, Amit Kapila.