14.01.2025 17:49, Zhou, Zhiguo пишет:
Good day, Yura!
On 1/10/2025 8:42 PM, Yura Sokolov wrote:
If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.
The profile of your patch didn't show significant hotspots in the hash
table functions, so I believe the 256 entries should be enough.
I will soon try to evaluate the performance impact of your patch on
my device with the TPCC benchmark and also profile it to see if there
are any changes that could be made to further improve it.
It would be great. On my notebook (Mac Air M1) I don't see any
benefits neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).
I've tested the performance impact of our patches on an Intel Sapphire
Rapids device with 480 vCPUs using a HammerDB TPC-C workload (256 VUs).
The results show a 72.3% improvement (average of 3 rounds, RSD: 1.5%)
with your patch and a 76.0% boost (average of 3 rounds, RSD: 2.95%) with
mine, applied to the latest codebase. This optimization is most
effective on systems with over 64 cores, as our core-scaling experiments
suggest minimal impact on lower-core setups like your notebook or a 20-
core VM.
BTW, do you have a plan to merge this patch to the master branch?
Thanks!
I'm not committer )) We are both will struggle to make something
committed for many months ;-)
BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.
Since there are two limits:
#define XLogRecordMaxSize (1020 * 1024 * 1024)
#define WalSegMaxSize 1024 * 1024 * 1024
offset to previous record could not be larger than 2GB.
Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better
compressible.
And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).
----
regards
Sokolov Yura aka funny-falcon
Thanks for the great suggestion!
I think we've arrived at a critical juncture where we need to decide
which patch to move forward with for our optimization efforts. I've
evaluated the pros and cons of my implementation:
Pros:
-Achieves an additional 4% performance improvement.
Cons:
-Breaks the current convention of XLog insertions.
-TAP tests are not fully passed, requiring time to resolve.
-May necessitate changes to the format and backup tools, potentially
leading to backward compatibility issues.
Given these considerations, I believe your implementation is superior to
mine. I'd greatly appreciate it if you could share your insights on this
matter.
Good day, Zhiguo.
Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its effect
even on my working notebook (it is another one: Ryzen 5825U limited to
@2GHz).
http://postgr.es/m/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru
-----
regards
Yura Sokolov aka funny-falcon