On Tue, Jan 9, 2024 at 11:20 PM Nathan Bossart <nathandboss...@gmail.com> wrote: > > On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote: > > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandboss...@gmail.com> > > wrote: > >> > >> > I suspect that there could be a regression lurking for some inputs > >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be > >> > able to read 4 vector registers worth of elements before taking the > >> > fast path. There is then a tail of up to 15 elements that are now > >> > checked one-by-one, but AVX2 would increase that to 31. That's getting > >> > big enough to be noticeable, I suspect. It would be good to understand > >> > that case (n*32 + 31), because it may also be relevant now. It's also > >> > easy to improve for SSE2/NEON for v17. > >> > >> Good idea. If it is indeed noticeable, we might be able to "fix" it by > >> processing some of the tail with shorter vectors. But that probably means > >> finding a way to support multiple vector sizes on the same build, which > >> would require some work. > > > > What I had in mind was an overlapping pattern I've seen in various > > places: do one iteration at the beginning, then subtract the > > aligned-down length from the end and do all those iterations. And > > one-by-one is only used if the total length is small. > > Sorry, I'm not sure I understood this. Do you mean processing the first > several elements individually or with SSE2 until the number of remaining > elements can be processed with just the AVX2 instructions (a bit like how > pg_comp_crc32c_armv8() is structured for memory alignment)?
If we have say 25 elements, I mean (for SSE2) check the first 16, then the last 16. Some will be checked twice, but that's okay.