And Jiahao Chen writes: > I tried to manually inline idxmaxabs. It made absolutely no difference > on my machine. The row scaling takes ~0.05% of total execution time.
Simply inlining, sure, but you could scale inside the outer loop and find next the pivot in the inner loop. Making only a single pass over the data should save more than 0.05% once you leave cache. But as long as you're in cache (500x500 is approx. 2MiB), not much will matter. Ultimately, I'm not sure who's interested in complete pivoting for LU. That choice alone kills performance on modern machines for negligible benefit. You likely would find more interest for column-pivoted QR or rook pivoting in LDL^T.
