https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811
--- Comment #15 from Jan Hubicka <hubicka at gcc dot gnu.org> --- With SRA improvements r:aae723d360ca26cd9fd0b039fb0a616bd0eae363 we finally get good performance at -O2. Improvements to push_back implementation also helps a bit. Mainline with default flags (-O2): Input: JPEG - Quality: 90: 19.76 19.75 19.68 Mainline with -O2 -march=native: Input: JPEG - Quality: 90: 20.01 20 19.98 Mainline with -O2 -march=native -flto Input: JPEG - Quality: 90: 19.95 19.98 19.81 Mainline with -O2 -march=native -flto --param max-inline-insns-auto=80 (this makes push_back inlined) Input: JPEG - Quality: 90: 19.98 20.05 20.03 Mainline with -O2 -flto -march=native -I/usr/include/c++/v1 -nostdinc++ -lc++ (so clang's libc++) 21.38 21.37 21.32 Mainline with -O2 -flto -march=native run manualy since build machinery patch is needed 23.03 22.85 23.04 Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with system libstdc++ from GCC13 so before push_back improvements. 21.16 20.95 21.06 Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with trunk libstdc++ with push_back improvements. 21.2 20.93 20.98 Clang 17 with -O2 -march=native -flto -stdlib=libc++ and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with clan'g libc++ Input: JPEG - Quality: 90: 22.08 21.88 21.78 Clang 17 with -O3 -march=native -flto 23.08 22.90 22.84 libc++ declares push_back always_inline and splits out the slow copying path. I think the inlined part is still bit too large for inlining at -O2. We could still try to get remaining approx 10% without increasing code size at -O2 However major part of the problem is solved.