Re: Speed up COPY FROM text/CSV parsing using SIMD

KAZAR Ayoub Fri, 06 Feb 2026 14:36:44 -0800

Hello,

On Fri, Feb 6, 2026 at 11:19 PM Nazir Bilal Yavuz <[email protected]>
wrote:


> Hi,
>
> Thank you for sharing your thoughts!
>
> On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <[email protected]>
> wrote:
> >
> > It looks like a lot of energy has been put into benchmarking and refining
> > the heuristic for deciding when to use the SIMD path so that we avoid
> large
> > regressions when there are special characters.  I think this is all
> > valuable work, but I'm a bit concerned that we are putting the cart
> before
> > the horse.  IMHO it would be better to first get the SIMD code committed
> > with the absolute simplest heuristic we can think of (e.g., as soon as we
> > see a special character, switch to the scalar path for the remainder of
> > COPY FROM).  My hope is that would be far easier to reason about from a
> > performance angle.  If we immediately fall back to the existing code
> path,
> > we don't need to worry about how many special characters there are and
> > whether they are sparse or clustered or whatever.  We just need to
> measure
> > the overhead of the new branches and ensure they don't produce meaningful
> > regressions.  Assuming that all looks good, we can then focus on the SIMD
> > code itself and make sure that is correct and optimal.  And once we get
> > that portion committed, we could then consider more sophisticated
> > heuristics.
>
I also agree on this, especially for the line_buf refilling idea, it needs
a bit more time to find the good value of threshold than work for
heuristic.

>
> I have three possible approaches in my mind, they are actually similar
> to each other.
>
> 1- After encountering a special character, disable SIMD for the rest
> of the current line and also for the rest of the data.
>
> 2- It is a mixed version of the current heuristic and #1. After
> encountering a special character, skip SIMD for the current line (let'
> say line 1) and for the next line (line 2). Then try running SIMD for
> the next line (line 3), if there is no special character continue to
> run SIMD but if there is a special character then skip running SIMD
> for two lines this time. And it goes like that, everytime special
> character is encountered in the SIMD run, skipped SIMD lines are
> doubled.
>
> 3- This version is a bit different from #2. Instead of calculating the
> number of lines to skip dynamically, skip the constant N number of
> lines and then try to run SIMD again after these lines. N could be
> something like 100, 1000, or 10000 etc.. Actually, you and Andrew
> suggested this approach before [1].
>
> I think what you suggested is closer to #1 or #3. I just wanted to
> hear your opinions, and whether you think any of these approaches are
> good to implement / work on.
>
For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than
v4.2 so this has good potential, i can fully benchmark it against v3 as
soon as you send a patch for it.


Regards,
Ayoub

Re: Speed up COPY FROM text/CSV parsing using SIMD

Reply via email to