Hello. I ran Nazir's v11 patch on my x86 tower PC and my arm raspberry pi using the same build I've been using: meson with "debugoptimized", which translates to "-g -O2" gcc flags.
x86 NARROW old master (18bcdb75) TXT : 25909.060500 ms CSV : 28137.591250 ms TXT with 1/3 escapes: 27794.177000 ms CSV with 1/3 quotes: 34541.704750 ms x86 NARROW v10 TXT : 26416.331500 ms -1.957890% regression CSV : 25318.727500 ms 10.018142% improvement TXT with 1/3 escapes: 28608.007500 ms -2.928061% regression CSV with 1/3 quotes: 32805.627750 ms 5.026032% improvement x86 NARROW v11 TXT : 27212.945750 ms -5.032545% regression CSV : 26985.971250 ms 4.092817% improvement TXT with 1/3 escapes: 27216.510000 ms 2.078374% improvement CSV with 1/3 quotes: 32817.267500 ms 4.992334% improvement x86 WIDE old master (18bcdb75) TXT : 28778.426500 ms CSV : 35671.908000 ms TXT with 1/3 escapes: 32441.549750 ms CSV with 1/3 quotes: 47024.416000 ms x86 WIDE v10 TXT : 23067.046750 ms 19.846046% improvement CSV : 23259.092250 ms 34.797174% improvement TXT with 1/3 escapes: 31796.098250 ms 1.989583% improvement CSV with 1/3 quotes: 42925.792250 ms 8.715948% improvement x86 WIDE v11 TXT : 22571.305750 ms 21.568659% improvement CSV : 22711.524750 ms 36.332184% improvement TXT with 1/3 escapes: 29236.453000 ms 9.879604% improvement CSV with 1/3 quotes: 40022.110750 ms 14.890786% improvement arm NARROW old master (18bcdb75) TXT : 10997.568250 ms CSV : 10797.549000 ms TXT with 1/3 escapes: 10299.047000 ms CSV with 1/3 quotes: 12559.385750 ms arm NARROW v10 TXT : 10467.816750 ms 4.816988% improvement CSV : 9986.288000 ms 7.513381% improvement TXT with 1/3 escapes: 10323.173750 ms -0.234262% regression CSV with 1/3 quotes: 11843.611750 ms 5.699116% improvement arm NARROW v11 TXT : 10340.966250 ms 5.970429% improvement CSV : 10224.399500 ms 5.308144% improvement TXT with 1/3 escapes: 10438.216750 ms -1.351288% regression CSV with 1/3 quotes: 11865.934000 ms 5.521383% improvement arm WIDE old master (18bcdb75) TXT : 11825.771250 ms CSV : 13907.074000 ms TXT with 1/3 escapes: 13430.691250 ms CSV with 1/3 quotes: 17557.954500 ms arm WIDE v10 TXT : 9064.959000 ms 23.345727% improvement CSV : 9019.553250 ms 35.144134% improvement TXT with 1/3 escapes: 12344.497250 ms 8.087402% improvement CSV with 1/3 quotes: 15495.863750 ms 11.744482% improvement arm WIDE v11 TXT : 9001.442250 ms 23.882831% improvement CSV : 8940.928750 ms 35.709490% improvement TXT with 1/3 escapes: 12049.668500 ms 10.282589% improvement CSV with 1/3 quotes: 15277.843250 ms 12.986201% improvement Best, -Manni On Thu, Mar 5, 2026 at 3:25 PM Andrew Dunstan <[email protected]> wrote: > > On 2026-03-04 We 10:15 AM, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Mon, 2 Mar 2026 at 22:55, Nathan Bossart <[email protected]> > wrote: > >> On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote: > >>> If anyone has any suggestions/ideas, please let me know! > > I am able to fix the problem. My first assumption was that the > > branching of SIMD code caused that problem, so I moved SIMD code to > > the CopyReadLineTextSIMDHelper() function. Then I moved this > > CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing > > that we won't have any branching in the non-SIMD (scalar) code path. > > This didn't solve the problem and then I realized that even though I > > disable SIMD code path with 'if (false)', there is still regression > > but if I comment all of the 'if (cstate->simd_enabled)' branch, then > > there is no regression at all. > > > > To find out more, I compared assembly outputs of both and found out > > the possible reason. What I understood is that the compiler can't > > promote a variable to register, instead these variables live in the > > stack; which is slower. Please see the two different assembly outputs: > > > > Slow code: > > > > c = copy_input_buf[input_buf_ptr++]; > > db0: 48 8b 55 b8 mov -0x48(%rbp),%rdx > > db4: 48 63 c6 movslq %esi,%rax > > db7: 44 8d 66 01 lea 0x1(%rsi),%r12d > > dbb: 44 89 65 cc mov %r12d,-0x34(%rbp) > > dbf: 0f be 14 02 movsbl (%rdx,%rax,1),%edx > > > > Fast code: > > > > c = copy_input_buf[input_buf_ptr++]; > > d80: 49 63 c4 movslq %r12d,%rax > > d83: 45 8d 5c 24 01 lea 0x1(%r12),%r11d > > d88: 41 0f be 04 06 movsbl (%r14,%rax,1),%eax > > > > And the reason for that is sending the address of input_buf_ptr to a > > CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to > > this: > > > > int temp_input_buf_ptr = input_buf_ptr; > > CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr); > > > > Then there is no regression. However, I am still not completely sure > > if that is the same problem in the v10, I am planning to spend more > > time debugging this. > > > >> A couple of random ideas: > >> > >> * Additional inlining for callers. I looked around a little bit and > didn't > >> see any great candidates, so I don't have much faith in this, but maybe > >> you'll see something I don't. > > I agree with you. CopyReadLineText() is already quite a big function. > > > >> * Disable SIMD if we are consistently getting small rows. That won't > help > >> your "wide & CSV 1/3" case in all likelihood, but perhaps it'll help > with > >> the regression for narrow rows described elsewhere. > > I implemented this, two consecutive small rows disables SIMD. > > > >> * Surround the variable initializations with "if (simd_enabled)". > >> Presumably compilers are smart enough to remove those in the non-SIMD > paths > >> already, but it could be worth a try. > > Done. > > > >> * Add simd_enabled function parameter to CopyReadLine(), > >> NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and do > the > >> bool literal trick in CopyFrom{Text,CSV}OneRow(). That could encourage > the > >> compiler to do some additional optimizations to reduce branching. > > I think we don't need this. At least the implementation with > > CopyReadLineTextSIMDHelper() doesn't need this since branching will be > > at the top and it will be once per line. > > > > I think v11 looks better compared to v10. I liked the > > CopyReadLineTextSIMDHelper() helper function. I also liked it being at > > the top of CopyReadLineText(), not being in the scalar path. This > > gives us more optimization options without affecting the scalar path. > > > > Here are the new benchmark results, I benchmarked the changes with > > both -O2 and -O3 and also both with and without 'changing > > default_toast_compression to lz4' commit (65def42b1d5). Benchmark > > results show that there is no regression and the performance > > improvement is much bigger with 65def42b1d5, it is close to 2x for > > text format and more than 2x for the csv format. > > > I spent some time exploring different ideas for improving this, but > found none that didn't cause regression in some cases, so good to go > from my POV. > > > cheers > > > andrew > > > > -- > Andrew Dunstan > EDB: https://www.enterprisedb.com > > -- -- Manni Wood EDB: https://www.enterprisedb.com
