Hi,
On Mon, 2 Mar 2026 at 22:55, Nathan Bossart <[email protected]> wrote:
On Wed, Feb 25, 2026 at 05:24:27PM +0300, Nazir Bilal Yavuz wrote:
If anyone has any suggestions/ideas, please let me know!
I am able to fix the problem. My first assumption was that the
branching of SIMD code caused that problem, so I moved SIMD code to
the CopyReadLineTextSIMDHelper() function. Then I moved this
CopyReadLineTextSIMDHelper() to top of CopyReadLineText(), by doing
that we won't have any branching in the non-SIMD (scalar) code path.
This didn't solve the problem and then I realized that even though I
disable SIMD code path with 'if (false)', there is still regression
but if I comment all of the 'if (cstate->simd_enabled)' branch, then
there is no regression at all.
To find out more, I compared assembly outputs of both and found out
the possible reason. What I understood is that the compiler can't
promote a variable to register, instead these variables live in the
stack; which is slower. Please see the two different assembly outputs:
Slow code:
c = copy_input_buf[input_buf_ptr++];
db0: 48 8b 55 b8 mov -0x48(%rbp),%rdx
db4: 48 63 c6 movslq %esi,%rax
db7: 44 8d 66 01 lea 0x1(%rsi),%r12d
dbb: 44 89 65 cc mov %r12d,-0x34(%rbp)
dbf: 0f be 14 02 movsbl (%rdx,%rax,1),%edx
Fast code:
c = copy_input_buf[input_buf_ptr++];
d80: 49 63 c4 movslq %r12d,%rax
d83: 45 8d 5c 24 01 lea 0x1(%r12),%r11d
d88: 41 0f be 04 06 movsbl (%r14,%rax,1),%eax
And the reason for that is sending the address of input_buf_ptr to a
CopyReadLineTextSIMDHelper(..., &input_buf_ptr). If I change it to
this:
int temp_input_buf_ptr = input_buf_ptr;
CopyReadLineTextSIMDHelper(..., &temp_input_buf_ptr);
Then there is no regression. However, I am still not completely sure
if that is the same problem in the v10, I am planning to spend more
time debugging this.
A couple of random ideas:
* Additional inlining for callers. I looked around a little bit and didn't
see any great candidates, so I don't have much faith in this, but maybe
you'll see something I don't.
I agree with you. CopyReadLineText() is already quite a big function.
* Disable SIMD if we are consistently getting small rows. That won't help
your "wide & CSV 1/3" case in all likelihood, but perhaps it'll help with
the regression for narrow rows described elsewhere.
I implemented this, two consecutive small rows disables SIMD.
* Surround the variable initializations with "if (simd_enabled)".
Presumably compilers are smart enough to remove those in the non-SIMD paths
already, but it could be worth a try.
Done.
* Add simd_enabled function parameter to CopyReadLine(),
NextCopyFromRawFieldsInternal(), and CopyFromTextLikeOneRow(), and do the
bool literal trick in CopyFrom{Text,CSV}OneRow(). That could encourage the
compiler to do some additional optimizations to reduce branching.
I think we don't need this. At least the implementation with
CopyReadLineTextSIMDHelper() doesn't need this since branching will be
at the top and it will be once per line.
I think v11 looks better compared to v10. I liked the
CopyReadLineTextSIMDHelper() helper function. I also liked it being at
the top of CopyReadLineText(), not being in the scalar path. This
gives us more optimization options without affecting the scalar path.
Here are the new benchmark results, I benchmarked the changes with
both -O2 and -O3 and also both with and without 'changing
default_toast_compression to lz4' commit (65def42b1d5). Benchmark
results show that there is no regression and the performance
improvement is much bigger with 65def42b1d5, it is close to 2x for
text format and more than 2x for the csv format.