Re: Speed up COPY FROM text/CSV parsing using SIMD

Manni Wood Tue, 20 Jan 2026 12:49:29 -0800

On Sat, Jan 17, 2026 at 3:25 PM KAZAR Ayoub <[email protected]> wrote:


> Hello,
> Thank you for these benchmarks, this is helpful !
>
> On Wed, Jan 14, 2026 at 1:20 AM Manni Wood <[email protected]>
> wrote:
>
>>
>>
>> Hello!
>>
>> Nazir, I'm glad you are finding the benchmarks useful. I have more! :-)
>>
>> All of these benchmarks are all-in-RAM, because I do think that is the
>> best way of getting closest to the theoretical best and worst case
>> scenarios.
>>
>> My laptop:
>>
>> master: (852558b9)
>>
>> text, no special: 14996
>> text, 1/3 special: 17270
>> csv, no special: 18274
>> csv, 1/3 special: 23852
>>
>> v3
>>
>> text, no special: 11282 (24.7% speedup)
>> text, 1/3 special: 15748 (8.8% speedup) <-- I don't believe this but it's
>> what I got
>> csv, no special: 11571 (36.6% speedup)
>> csv, 1/3 special: 19934 (16.4% speedup) <-- I don't believe this but it's
>> what I got
>>
>> v4.2
>>
>> text, no special: 11139 (25.7% speedup)
>> text, 1/3 special: 18900 (9.4% regression)
>> csv, no special: 11490 (37.1% speedup)
>> csv, 1/3 special: 26134 (9.5% regression)
>>
>> An AWS EC2 t2.2xlarge instance
>>
>> master: (852558b9)
>>
>> text, no special: 20677
>> text, 1/3 special: 22660
>> csv, no special: 24534
>> csv, 1/3 special: 30999
>>
>> v3
>>
>> text, no special: 17534 (15.2% speedup)
>> text, 1/3 special: 22816 (0.6% regression)
>> csv, no special: 17664 (28.0% speedup)
>> csv, 1/3 special: 29338 (5.3% speedup) <-- I don't believe this but it's
>> what I got
>>
>> v4.2
>>
>> text, no special: 17459 (15.5% speedup)
>> text, 1/3 special: 25051 (10.5% regression)
>> csv, no special: 17574 (28.3% speedup)
>> csv, 1/3 special: 32092 (3.5% regression)
>>
>> An AWS EC2 t4g.2xlarge instance (using ARM processor; first test of ARM
>> processor!)
>>
>> master: (852558b9)
>>
>> text, no special: 22081
>> text, 1/3 special: 25100
>> csv, no special: 27296
>> csv, 1/3 special: 32344
>>
>> v3
>>
>> text, no special: 17724 (19.7% speedup)
>> text, 1/3 special: 27606 (9.9% regression) <-- yikes! We would want to
>> test this more
>> csv, no special: 17597 (35.5% speedup)
>> csv, 1/3 special: 32597 (0.8% regression)
>>
>> v4.2
>>
>> text, no special: 17674 (20% speedup)
>> text, 1/3 special: 25773 (2.6% regression) <-- this regression is less
>> than for the v3 patch? Atypical...
>> csv, no special: 17651 (35.3% speedup)
>> csv, 1/3 special: 34055 (5.3% regression)
>>
>> Yes, I think I agree with you that the everything-in-RAM benchmarks will
>> make the regressions more pronounced, just like the everything-in-RAM
>> benchmarks make the improvements more pronounced.
>>
>> I am not sure why the CSV regression, compared to the TXT regression
>> (even for the v3 patch which has smaller regressions than the v4.2 patch)
>> is usually worse. I probably should look over some flame graphs and see if
>> I can find the place where the CSV-parsing code is so much slower. The CSV
>> regression is actually a bit frustrating (at around 5%) because the TXT
>> regression, at less than 1% (for the v3 patch) is so much easier to bare.
>>
> The only reasons that i can think of for this problem is the CSV state
> machine is more complex than TEXT, which might imply that for everything
> related to branch prediction, stalls ..etc becomes more demanding in CSV
> mode, i see this by previous tight micro benchmarks on CopyReadLineText, it
> has tiny less IPC, more branch misses, stalls and i assume instruction
> cache misses shouldn't be a problem since the generated code duplicates the
> scalar path, also the code for it isn't that large for one core instruction
> cache anyways (mine has 5.8KB per core).
> We can use perf_event_open[1] around CopyReadLine to see what's going on
> exactly with the counters if someone wants to confirm.
>
>>
>>
>
>> Here are some copy-to benchmarks for the v4 patch that applies SIMD to
>> the copy-to code.
>>
>> These were all-in-RAM tests.
>>
>> My laptop
>>
>> master: (852558b9)
>>
>> text, no special: 2948
>> text, 1/3 special: 11258
>> csv, no special: 6245
>> csv, 1/3 special: 11258
>>
>> v4 (copy to)
>>
>> text, no special: 2126 (27.9% speedup)
>> text, 1/3 special: 12080 (7.3% regression) <-- did not see such a big
>> regression before
>> csv, no special: 2432 (61.0% speedup)
>> csv, 1/3 special: 12344 (4.0% regression) <-- did not see such a big
>> regression before
>>
>> An AWS EC2 t2.2xlarge instance
>>
>> master: (852558b9)
>>
>> text, no special: 4647
>> text, 1/3 special: 13865
>> csv, no special: 5421
>> csv, 1/3 special: 15284
>>
>> v4 (copy to)
>>
>> text, no special: 2460 (47.0% speedup)
>> text, 1/3 special: 14023 (1.1% regression)
>> csv, no special: 2667 (50.7% speedup)
>> csv, 1/3 special: 15251 (0.2% speedup)
>>
>> An AWS EC2 t4g.2xlarge instance (using ARM processor; first test of ARM
>> processor!)
>>
>> master: (852558b9)
>>
>> text, no special: 6951
>> text, 1/3 special: 17857
>> csv, no special: 7951
>> csv, 1/3 special: 18504
>>
>> v4 (copy to)
>>
>> text, no special: 3372 (51.4% speedup)
>> text, 1/3 special: 15713 (12.0% speedup)
>> csv, no special: 3233 (59.3% speedup)
>> csv, 1/3 special: 1622 (12.3% speedup)
>>
>> Once again, the v4 patch for copy-to seems like a clearer win, though, to
>> be fair, there were regressions when running on my laptop. (I'm starting to
>> think servers or desktops are better than laptops for testing these things,
>> though maybe that's my bias: it just seems like the server results are
>> always less surprising.)
>>
> Indeed, I agree on this too.
>
> [1]https://man7.org/linux/man-pages/man2/perf_event_open.2.html
>
> Regards,
> Ayoub Kazar
>

Hello, all I have more benchmarks.

These benchmarks are from a Raspberry Pi 5 that I bought. It has an Arm
Cortex A76 processor.

(I was so impressed with the stability of the results I got on my
standalone Intel tower PC that I figured I needed a standalone Arm-based
machine that was not a laptop and not a VM at a cloud service provider. The
run-to-run results were indeed more stable, just like with my standalone
tower PC.)

COPY FROM

master: (852558b9)

text, no special: 9111
text, 1/3 special: 10302
csv, no special: 11147
csv, 1/3 special: 13375

v3

text, no special: 7351 (19.3% speedup)
text, 1/3 special: 10397 (0.9% regression)
csv, no special: 7272 (34.7% speedup)
csv, 1/3 special: 13472 (0.7% regression)

v4.2

text, no special: 7300 (19.6% speedup)
text, 1/3 special: 10537 (2.3% regression)
csv, no special: 7260 (34.8% speedup)
csv, 1/3 special: 13881 (3.8% regression)

COPY TO

master: (852558b9)

text, no special: 2446
text, 1/3 special: 6988
csv, no special: 2822
csv, 1/3 special: 6967

v4 (copy to)

text, no special: 1533 (37.3% speedup)
text, 1/3 special: 5949 (14.8% speedup)
csv, no special: 1560 (44.7% speedup)
csv, 1/3 special: 6006 (13.8% speedup)

I find these results particularly exciting because with the COPY FROM v3
patch, the worst-case scenarios are just under 1% regression. The v4 COPY
TO patch is a win across the board.

Note that I ran these benchmarks with everything in RAM disk and using the
cpupower instructions that Nazir suggested.

So on Arm, the v3 COPY FROM patch is almost all upside, and the v4 COPY TO
patch is all upside. The same is almost true for Intel, but the CSV COPY
FROM regression, even from the V3 COPY FROM patch, is about 5%. The v4.2
COPY FROM patch always performs worse than the v3 COPY FROM patch in
worst-case scenarios.

Does it seem reasonable to stop performance testing the v4.2 COPY FROM
patch? Have we collected enough benchmark data to be confident that the v3
COPY FROM patch is the one we should be moving forward with?

Hope you are all having a great day,

-Manni
-- 
-- Manni Wood EDB: https://www.enterprisedb.com

Re: Speed up COPY FROM text/CSV parsing using SIMD

Reply via email to