ritter-x2a wrote: One thing that's not ideal with this patch is that the lowering for memsets with large statically known sizes wastes registers with the SelectionDAG ISel: In this case, the IR lowering uses a `<64xi32>` store in the main memset loop. We correctly legalize this into 16 dwordx4 stores, but the huge `<64xi32>` splat value that is stored there (and which consists of 256 times the same byte) lives in a different basic block than the stores. SDAG ISel therefore doesn't know that those 64 32-bit registers with the same value are not needed at the same time and that 4 would be enough (GlobalISel, since it can look across BBs, doesn't have this problem).
You can see this for example in `@memset_p0_sz1055_align_4_varsetval` in `memset-param-combinations.ll`. I tried adjusting the IR lowering to put the splat values in the same basic block as the accesses, but then they are LICM-ed out again. I also tried adjusting the lowering to use N(=16) `<4xi32>` stores (with only a single `<4xi32>` splat), and while that fixed the register wastage, it made code generation worse in a different way, because the SCEV-based strength reduction (loop-reduce) then replaces the address computations with new computations that don't use `inbounds` and `nuw`, which means that offsets cannot be folded into store instructions in various cases. This even happens if I change the memset lowering to produce the form that loop-reduce would generate: it still re-generates the address computation minus the poison-generating flags. The effect of the register wastage here is probably in practice not very dramatic because this only happens for quite large memsets that will take some time anyway, but do let me know if you have suggestions on how to avoid it. https://github.com/llvm/llvm-project/pull/169040 _______________________________________________ llvm-branch-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
