Yeah, that wouldn't hurt :-). The size difference of the .text section, what I assume is the most relevant for icache misses, is only a 27712 byte *decrease*, from 0x16ab445 to 0x16a4805, so the size without partial linking actually went down slightly. I think what may actually be happening is that the layout of the text section is different in memory, and hot items are aliasing in the cache and kicking each other out.
As far as the pybind11 fix, I don't think that will actually change the final binary size very much, and the savings that will get us will be primarily during the actual build. The way it is now, it's like every tile that includes pybind11 directly or indirectly (which is a lot of them) links against a copy of the pybind11 "library". That means we have hundreds of copies of the common pybind11 code being compiled into and left in .o files across the build directory. Then when the final link happens, the linker not only has to wade through all that and load all of it off disk, it also has to find and consolidate all those extra copies so that we get back down to exactly one instance of the common pybind11 machinery. With Ciro's change, the common stuff (the stuff that doesn't need to be templates) will be seperated and compiled exactly once, have exactly one copy in the build directory, and be linked into the final binary exactly once with no extra copies to load, identify and purge. The build time and space savings should be substantial (and are according to Ciro's earlier measurements), but the actual final binary should end up being *roughly* about the same. For reference, at the end of scons fixup branch, the size of just the build/X86/python/_m5 directory is 1.6GB (which is the worst offender), while the final binary itself is only 222MB. Gabe On Wed, Feb 10, 2021 at 7:41 AM Jason Lowe-Power <[email protected]> wrote: > This is interesting! My takeaway: we need bigger i-caches to run gem5 ;). > > Is the binary about the same size for the two conditions? Could it be > simply that the instruction working set is bigger when not using partial > linking? > > That said, a 3% performance difference isn't a big problem, in my opinion. > This analysis also gives us some interesting directions for future > optimizations. I wonder how much the library version of pybind will help > since that will significantly reduce the instruction footprint. > > Jason > > On Tue, Feb 9, 2021 at 10:56 PM Gabe Black via gem5-dev <[email protected]> > wrote: > >> I did some measurements before and after, and I noticed a few things. >> First, the iTLB-load-misses stat drops form 0.25% all the way down to >> 0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27% >> and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from >> 1.74% => 2.77%. >> >> So it looks like performance is generally about the same or a little >> better in most metrics, but for some reason icache hit rate drops. >> >> Performance measurements with partial linking: >> >> 429,882.68 msec task-clock:u # 1.000 CPUs >> utilized >> 0 context-switches:u # 0.000 K/sec >> >> 0 cpu-migrations:u # 0.000 K/sec >> >> 145,986 page-faults:u # 0.340 K/sec >> >> 1,830,956,683,109 cycles:u # 4.259 GHz >> (35.71%) >> 31,472,946,642 stalled-cycles-frontend:u # 1.72% frontend >> cycles idle (35.71%) >> 254,440,746,368 stalled-cycles-backend:u # 13.90% backend >> cycles idle (35.71%) >> 4,117,921,862,700 instructions:u # 2.25 insn per >> cycle >> # 0.06 stalled >> cycles per insn (35.71%) >> 773,059,098,367 branches:u # 1798.303 M/sec >> (35.71%) >> 2,775,345,450 branch-misses:u # 0.36% of all >> branches (35.71%) >> 2,329,109,097,524 L1-dcache-loads:u # 5418.011 M/sec >> (35.71%) >> 24,907,172,614 L1-dcache-load-misses:u # 1.07% of all >> L1-dcache accesses (35.71%) >> <not supported> LLC-loads:u >> >> <not supported> LLC-load-misses:u >> >> 872,678,362,265 L1-icache-loads:u # 2030.038 M/sec >> (35.71%) >> 15,221,564,231 L1-icache-load-misses:u # 1.74% of all >> L1-icache accesses (35.71%) >> 48,763,102,717 dTLB-loads:u # 113.434 M/sec >> (35.71%) >> 75,459,133 dTLB-load-misses:u # 0.15% of all dTLB >> cache accesses (35.71%) >> 8,416,573,693 iTLB-loads:u # 19.579 M/sec >> (35.72%) >> 20,650,906 iTLB-load-misses:u # 0.25% of all iTLB >> cache accesses (35.72%) >> >> 429.911532621 seconds time elapsed >> >> 428.611864000 seconds user >> 0.199257000 seconds sys >> >> >> Performance measurements without partial linking: >> >> 444,598.61 msec task-clock:u # 1.000 CPUs >> utilized >> 0 context-switches:u # 0.000 K/sec >> >> 0 cpu-migrations:u # 0.000 K/sec >> >> 145,528 page-faults:u # 0.327 K/sec >> >> 1,907,560,568,869 cycles:u # 4.291 GHz >> (35.71%) >> 24,156,412,003 stalled-cycles-frontend:u # 1.27% frontend >> cycles idle (35.72%) >> 202,601,144,555 stalled-cycles-backend:u # 10.62% backend >> cycles idle (35.72%) >> 4,118,200,832,359 instructions:u # 2.16 insn per >> cycle >> # 0.05 stalled >> cycles per insn (35.72%) >> 773,117,144,029 branches:u # 1738.910 M/sec >> (35.72%) >> 2,727,637,567 branch-misses:u # 0.35% of all >> branches (35.71%) >> 2,326,960,449,159 L1-dcache-loads:u # 5233.845 M/sec >> (35.71%) >> 26,778,818,764 L1-dcache-load-misses:u # 1.15% of all >> L1-dcache accesses (35.71%) >> <not supported> LLC-loads:u >> >> <not supported> LLC-load-misses:u >> >> 903,186,314,629 L1-icache-loads:u # 2031.465 M/sec >> (35.71%) >> 25,017,115,665 L1-icache-load-misses:u # 2.77% of all >> L1-icache accesses (35.71%) >> 50,448,039,415 dTLB-loads:u # 113.469 M/sec >> (35.71%) >> 78,186,127 dTLB-load-misses:u # 0.15% of all dTLB >> cache accesses (35.71%) >> 9,419,644,114 iTLB-loads:u # 21.187 M/sec >> (35.71%) >> 1,479,281 iTLB-load-misses:u # 0.02% of all iTLB >> cache accesses (35.71%) >> >> 444.623341115 seconds time elapsed >> >> 443.313786000 seconds user >> 0.256109000 seconds sys >> >> On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <[email protected]> wrote: >> >>> Out of curiosity I tried a quick x86 boot test, and say that the run >>> time with partial linking removed increased from just under 7 minutes to >>> about 7 and a half minutes. >>> >>> I thought about this for a while since at first I had no idea why that >>> might happen, and a theory I came up with was that when partial linking, >>> related bits of the simulator are grouped together since they're generally >>> in the same directory, and then those will likely end up in the same part >>> of the final binary. If those things are related, then you'll get better >>> locality as far as TLB performance and maybe paging things in. gem5 is such >>> a big executable that I doubt locality at that scale would make much of a >>> difference at the granularity cache lines. Also possibly the type of >>> relocations between those entities could be more efficient if the offset >>> they need to encode is smaller? >>> >>> If that's true, there are two ways I've thought of where we could get >>> that sort of behavior back without reintroducing partial linking, both of >>> which use attributes gcc provides which I assume clang would too. >>> >>> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized >>> particularly aggressively for performance, and "cold" makes the compiler >>> optimize for size. According to the docs, both could (probably do?) put the >>> items in question into separate sections where they would have better >>> locality, and the "cold" functions would stay out of the way. >>> >>> 2. Put things in different sections explicitly with the "section" >>> attribute. This could explicitly group items we'd want to show up near each >>> other like what partial linking does explicitly/accidentally. >>> >>> A third option might be to use profiling based optimization. I don't >>> know how to get gcc or clang to use that and what it requires, but I think >>> they at least *can* do something along those lines. That would hopefully >>> give the compiler enough information that it could figure some of these >>> things out on its own. >>> >>> The problem with this option might be that things we don't exercise in >>> the profiling (devices or CPUs or features that aren't used) may look >>> unimportant, but would be very important if the configuration of the >>> simulator was different. >>> >>> One other thing we might want to try, and I'm not sure how this would >>> work, might be to get gem5 loaded in with a larger page size somehow. Given >>> how big the binary is, reducing pressure on the TLB that way would probably >>> make a fairly big difference in performance. >>> >>> Gabe >>> >> _______________________________________________ >> gem5-dev mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s > >
_______________________________________________ gem5-dev mailing list -- [email protected] To unsubscribe send an email to [email protected] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
