Hello, On Fri, Feb 6, 2026 at 11:19 PM Nazir Bilal Yavuz <[email protected]> wrote:
> Hi, > > Thank you for sharing your thoughts! > > On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <[email protected]> > wrote: > > > > It looks like a lot of energy has been put into benchmarking and refining > > the heuristic for deciding when to use the SIMD path so that we avoid > large > > regressions when there are special characters. I think this is all > > valuable work, but I'm a bit concerned that we are putting the cart > before > > the horse. IMHO it would be better to first get the SIMD code committed > > with the absolute simplest heuristic we can think of (e.g., as soon as we > > see a special character, switch to the scalar path for the remainder of > > COPY FROM). My hope is that would be far easier to reason about from a > > performance angle. If we immediately fall back to the existing code > path, > > we don't need to worry about how many special characters there are and > > whether they are sparse or clustered or whatever. We just need to > measure > > the overhead of the new branches and ensure they don't produce meaningful > > regressions. Assuming that all looks good, we can then focus on the SIMD > > code itself and make sure that is correct and optimal. And once we get > > that portion committed, we could then consider more sophisticated > > heuristics. > I also agree on this, especially for the line_buf refilling idea, it needs a bit more time to find the good value of threshold than work for heuristic. > > I have three possible approaches in my mind, they are actually similar > to each other. > > 1- After encountering a special character, disable SIMD for the rest > of the current line and also for the rest of the data. > > 2- It is a mixed version of the current heuristic and #1. After > encountering a special character, skip SIMD for the current line (let' > say line 1) and for the next line (line 2). Then try running SIMD for > the next line (line 3), if there is no special character continue to > run SIMD but if there is a special character then skip running SIMD > for two lines this time. And it goes like that, everytime special > character is encountered in the SIMD run, skipped SIMD lines are > doubled. > > 3- This version is a bit different from #2. Instead of calculating the > number of lines to skip dynamically, skip the constant N number of > lines and then try to run SIMD again after these lines. N could be > something like 100, 1000, or 10000 etc.. Actually, you and Andrew > suggested this approach before [1]. > > I think what you suggested is closer to #1 or #3. I just wanted to > hear your opinions, and whether you think any of these approaches are > good to implement / work on. > For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than v4.2 so this has good potential, i can fully benchmark it against v3 as soon as you send a patch for it. Regards, Ayoub
