This is interesting! My takeaway: we need bigger i-caches to run gem5 ;). Is the binary about the same size for the two conditions? Could it be simply that the instruction working set is bigger when not using partial linking?
That said, a 3% performance difference isn't a big problem, in my opinion. This analysis also gives us some interesting directions for future optimizations. I wonder how much the library version of pybind will help since that will significantly reduce the instruction footprint. Jason On Tue, Feb 9, 2021 at 10:56 PM Gabe Black via gem5-dev <gem5-dev@gem5.org> wrote: > I did some measurements before and after, and I noticed a few things. > First, the iTLB-load-misses stat drops form 0.25% all the way down to > 0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27% > and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from > 1.74% => 2.77%. > > So it looks like performance is generally about the same or a little > better in most metrics, but for some reason icache hit rate drops. > > Performance measurements with partial linking: > > 429,882.68 msec task-clock:u # 1.000 CPUs utilized > > 0 context-switches:u # 0.000 K/sec > > 0 cpu-migrations:u # 0.000 K/sec > > 145,986 page-faults:u # 0.340 K/sec > > 1,830,956,683,109 cycles:u # 4.259 GHz > (35.71%) > 31,472,946,642 stalled-cycles-frontend:u # 1.72% frontend > cycles idle (35.71%) > 254,440,746,368 stalled-cycles-backend:u # 13.90% backend > cycles idle (35.71%) > 4,117,921,862,700 instructions:u # 2.25 insn per > cycle > # 0.06 stalled > cycles per insn (35.71%) > 773,059,098,367 branches:u # 1798.303 M/sec > (35.71%) > 2,775,345,450 branch-misses:u # 0.36% of all > branches (35.71%) > 2,329,109,097,524 L1-dcache-loads:u # 5418.011 M/sec > (35.71%) > 24,907,172,614 L1-dcache-load-misses:u # 1.07% of all > L1-dcache accesses (35.71%) > <not supported> LLC-loads:u > > <not supported> LLC-load-misses:u > > 872,678,362,265 L1-icache-loads:u # 2030.038 M/sec > (35.71%) > 15,221,564,231 L1-icache-load-misses:u # 1.74% of all > L1-icache accesses (35.71%) > 48,763,102,717 dTLB-loads:u # 113.434 M/sec > (35.71%) > 75,459,133 dTLB-load-misses:u # 0.15% of all dTLB > cache accesses (35.71%) > 8,416,573,693 iTLB-loads:u # 19.579 M/sec > (35.72%) > 20,650,906 iTLB-load-misses:u # 0.25% of all iTLB > cache accesses (35.72%) > > 429.911532621 seconds time elapsed > > 428.611864000 seconds user > 0.199257000 seconds sys > > > Performance measurements without partial linking: > > 444,598.61 msec task-clock:u # 1.000 CPUs utilized > > 0 context-switches:u # 0.000 K/sec > > 0 cpu-migrations:u # 0.000 K/sec > > 145,528 page-faults:u # 0.327 K/sec > > 1,907,560,568,869 cycles:u # 4.291 GHz > (35.71%) > 24,156,412,003 stalled-cycles-frontend:u # 1.27% frontend > cycles idle (35.72%) > 202,601,144,555 stalled-cycles-backend:u # 10.62% backend > cycles idle (35.72%) > 4,118,200,832,359 instructions:u # 2.16 insn per > cycle > # 0.05 stalled > cycles per insn (35.72%) > 773,117,144,029 branches:u # 1738.910 M/sec > (35.72%) > 2,727,637,567 branch-misses:u # 0.35% of all > branches (35.71%) > 2,326,960,449,159 L1-dcache-loads:u # 5233.845 M/sec > (35.71%) > 26,778,818,764 L1-dcache-load-misses:u # 1.15% of all > L1-dcache accesses (35.71%) > <not supported> LLC-loads:u > > <not supported> LLC-load-misses:u > > 903,186,314,629 L1-icache-loads:u # 2031.465 M/sec > (35.71%) > 25,017,115,665 L1-icache-load-misses:u # 2.77% of all > L1-icache accesses (35.71%) > 50,448,039,415 dTLB-loads:u # 113.469 M/sec > (35.71%) > 78,186,127 dTLB-load-misses:u # 0.15% of all dTLB > cache accesses (35.71%) > 9,419,644,114 iTLB-loads:u # 21.187 M/sec > (35.71%) > 1,479,281 iTLB-load-misses:u # 0.02% of all iTLB > cache accesses (35.71%) > > 444.623341115 seconds time elapsed > > 443.313786000 seconds user > 0.256109000 seconds sys > > On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <gabe.bl...@gmail.com> wrote: > >> Out of curiosity I tried a quick x86 boot test, and say that the run time >> with partial linking removed increased from just under 7 minutes to about 7 >> and a half minutes. >> >> I thought about this for a while since at first I had no idea why that >> might happen, and a theory I came up with was that when partial linking, >> related bits of the simulator are grouped together since they're generally >> in the same directory, and then those will likely end up in the same part >> of the final binary. If those things are related, then you'll get better >> locality as far as TLB performance and maybe paging things in. gem5 is such >> a big executable that I doubt locality at that scale would make much of a >> difference at the granularity cache lines. Also possibly the type of >> relocations between those entities could be more efficient if the offset >> they need to encode is smaller? >> >> If that's true, there are two ways I've thought of where we could get >> that sort of behavior back without reintroducing partial linking, both of >> which use attributes gcc provides which I assume clang would too. >> >> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized >> particularly aggressively for performance, and "cold" makes the compiler >> optimize for size. According to the docs, both could (probably do?) put the >> items in question into separate sections where they would have better >> locality, and the "cold" functions would stay out of the way. >> >> 2. Put things in different sections explicitly with the "section" >> attribute. This could explicitly group items we'd want to show up near each >> other like what partial linking does explicitly/accidentally. >> >> A third option might be to use profiling based optimization. I don't know >> how to get gcc or clang to use that and what it requires, but I think they >> at least *can* do something along those lines. That would hopefully give >> the compiler enough information that it could figure some of these things >> out on its own. >> >> The problem with this option might be that things we don't exercise in >> the profiling (devices or CPUs or features that aren't used) may look >> unimportant, but would be very important if the configuration of the >> simulator was different. >> >> One other thing we might want to try, and I'm not sure how this would >> work, might be to get gem5 loaded in with a larger page size somehow. Given >> how big the binary is, reducing pressure on the TLB that way would probably >> make a fairly big difference in performance. >> >> Gabe >> > _______________________________________________ > gem5-dev mailing list -- gem5-dev@gem5.org > To unsubscribe send an email to gem5-dev-le...@gem5.org > %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
_______________________________________________ gem5-dev mailing list -- gem5-dev@gem5.org To unsubscribe send an email to gem5-dev-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s