https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #6 from Peter Cordes <peter at cordes dot ca> --- Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization suggestion for doing both compares in one masked SUB of +-PBCx applies equally. And I think my testing with VBLENDVPS should apply equally to VBLENDVPD. Since this is `double`, if we're going branchless we should definitely be vectorizing for a pair of doubles, like doing xij = X0(1,i) - X0(1,j) and yij = X0(2,i) - X0(2,j) together with a vmovupd, and a vector of PBCx, PBCy. Even if we later need both x and y separately (if those FMAs in the asm are multiplying components of one vector), we might still come out ahead from doing the expensive input processing with PD, then it's only one `vunpckhpd` to get the Y element ready, and that can run in parallel with any x * z stuff Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get {X0,Y0} {Z0,X1} {Y1,Z1}. We get twice the work for a cost of only 3 extra unpacks, doing 2 i and j values at once. ---- If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len, width)