On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <and...@anarazel.de> wrote: > As noted in my email from a few minutes ago, I agree that optimizing this > shouldn't be a requirement for merging the patch.
Here's a new patch set. I think I've incorporated the performance fixes that you've suggested so far into this version. I also adjusted a couple of other things: - After further study of a point previously raised by Amit, I adjusted CreateCheckPoint slightly to call WALInsertLockAcquireExclusive significantly later than before. I think that there's no real reason to do it so early and that the current coding is probably just a historical leftover, but it would be good to have some review here. - I added a cross-check that when starting redo from a checkpoint whose redo pointer points to an earlier LSN that the checkpoint itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO record. - I combined what were previously 0002 and 0003 into a single patch, since that's how this would get committed. - I fixed up some comments. - I updated commit messages. Hopefully this is getting close to good enough. > I did verify that they continue to be a bottleneck even after (incorrectly > obviously), removing the spinlock. It's also not too surprising, the latency > of 64bit divs is just high, particularly on intel from a few years ago (my > cascade lake workstation) and IIRC there's just a single execution port for it > too, so multiple instructions can't be fully parallelized. The chipset on my laptop is even older. Coffee Lake, I think. I'm not really sure that there's a whole lot we can reasonably do about the divs unless you like the fastpath idea that I proposed earlier, or unless you want to write a patch to either get rid of short page headers or make long and short page headers the same number of bytes. I have to admit I'm surprised by how visible the division overhead is in this code path -- but I'm also somewhat inclined to view that less as evidence that division is something we should be desperate to eliminate and more as evidence that this code path is quite fast already. In light of your findings, it doesn't seem completely impossible to me that the speed of integer division in this code path could be part of what limits performance for some users, but I'm also not sure it's all that likely or all that serious, because we're deliberating creating test cases that insert unreasonable amounts of WAL without doing any actual work. In the real world, there's going to be a lot more other code running along with this code - probably at least the executor and some heap AM code - and I bet not all of that is as well-optimized as this is already. And it's also quite likely for many users that the real limits on the speed of the workload will be related to I/O or lock contention rather than CPU cost in any form. I'm not saying it's not worth worrying about it. I'm just saying that we should make sure the amount of worrying we do is calibrated to the true importance of the issue. -- Robert Haas EDB: http://www.enterprisedb.com
v8-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patch
Description: Binary data
v8-0002-During-online-checkpoints-insert-XLOG_CHECKPOINT_.patch
Description: Binary data