On Mon, Nov 17, 2025, 11:16 PM Nathan Bossart <[email protected]> wrote:
> (assuming there is a desire to > continue with it)? I'm hoping to start spending more time on it soon. > Somethings worth noting for future reference (so someone else wouldn't waste time thinking about it), previously I tried extra several micro optimizations inside and around CopyReadLineText: SIMD alignment*:* Forcing 16-byte aligned buffers so we could use aligned memory instructions (_mm_load_si128 vs _mm_loadu_si128) provided no measurable benefit on modern CPUs (there's definitely a thread somewhere talking about it that i didn't encounter yet). This likely explains why simd.h exclusively uses unaligned load intrinsics the performance difference has become negligible since Nehalem processors. Memory prefetching: Explicit prefetch instructions for the COPY buffer pipeline (copy_raw_buf, input buffers, etc.) either showed no improvement or slight regression. Multiple chunks are already within a cache line, other buffers are too far to prefetch and the next part of the buffer is easily prefetched, nothing special, so it turns out to be not worth having more uops. Instruction-level parallelism: Spreading too many independent vector operations to increase ILP eventually degrades performance, likely due to backend saturation observed through perf (execution port and execution units contention most likely ?) ..... This simply suggests that further optimization work should focus on the pipeline as a whole for large benefits (parallel copy[0], maybe ?). [0] https://www.postgresql.org/message-id/CAA4eK1+kpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV=g...@mail.gmail.com -- Regards, Ayoub Kazar
