I did some measurements before and after, and I noticed a few things.
First, the iTLB-load-misses stat drops form 0.25% all the way down to
0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27%
and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from
1.74% => 2.77%.
So it looks like performance is generally about the same or a little better
in most metrics, but for some reason icache hit rate drops.
Performance measurements with partial linking:
429,882.68 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
145,986 page-faults:u # 0.340 K/sec
1,830,956,683,109 cycles:u # 4.259 GHz
(35.71%)
31,472,946,642 stalled-cycles-frontend:u # 1.72% frontend
cycles idle (35.71%)
254,440,746,368 stalled-cycles-backend:u # 13.90% backend cycles
idle (35.71%)
4,117,921,862,700 instructions:u # 2.25 insn per cycle
# 0.06 stalled cycles
per insn (35.71%)
773,059,098,367 branches:u # 1798.303 M/sec
(35.71%)
2,775,345,450 branch-misses:u # 0.36% of all
branches (35.71%)
2,329,109,097,524 L1-dcache-loads:u # 5418.011 M/sec
(35.71%)
24,907,172,614 L1-dcache-load-misses:u # 1.07% of all
L1-dcache accesses (35.71%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
872,678,362,265 L1-icache-loads:u # 2030.038 M/sec
(35.71%)
15,221,564,231 L1-icache-load-misses:u # 1.74% of all
L1-icache accesses (35.71%)
48,763,102,717 dTLB-loads:u # 113.434 M/sec
(35.71%)
75,459,133 dTLB-load-misses:u # 0.15% of all dTLB
cache accesses (35.71%)
8,416,573,693 iTLB-loads:u # 19.579 M/sec
(35.72%)
20,650,906 iTLB-load-misses:u # 0.25% of all iTLB
cache accesses (35.72%)
429.911532621 seconds time elapsed
428.611864000 seconds user
0.199257000 seconds sys
Performance measurements without partial linking:
444,598.61 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
145,528 page-faults:u # 0.327 K/sec
1,907,560,568,869 cycles:u # 4.291 GHz
(35.71%)
24,156,412,003 stalled-cycles-frontend:u # 1.27% frontend
cycles idle (35.72%)
202,601,144,555 stalled-cycles-backend:u # 10.62% backend cycles
idle (35.72%)
4,118,200,832,359 instructions:u # 2.16 insn per cycle
# 0.05 stalled cycles
per insn (35.72%)
773,117,144,029 branches:u # 1738.910 M/sec
(35.72%)
2,727,637,567 branch-misses:u # 0.35% of all
branches (35.71%)
2,326,960,449,159 L1-dcache-loads:u # 5233.845 M/sec
(35.71%)
26,778,818,764 L1-dcache-load-misses:u # 1.15% of all
L1-dcache accesses (35.71%)
<not supported> LLC-loads:u
<not supported> LLC-load-misses:u
903,186,314,629 L1-icache-loads:u # 2031.465 M/sec
(35.71%)
25,017,115,665 L1-icache-load-misses:u # 2.77% of all
L1-icache accesses (35.71%)
50,448,039,415 dTLB-loads:u # 113.469 M/sec
(35.71%)
78,186,127 dTLB-load-misses:u # 0.15% of all dTLB
cache accesses (35.71%)
9,419,644,114 iTLB-loads:u # 21.187 M/sec
(35.71%)
1,479,281 iTLB-load-misses:u # 0.02% of all iTLB
cache accesses (35.71%)
444.623341115 seconds time elapsed
443.313786000 seconds user
0.256109000 seconds sys
On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <[email protected]> wrote:
> Out of curiosity I tried a quick x86 boot test, and say that the run time
> with partial linking removed increased from just under 7 minutes to about 7
> and a half minutes.
>
> I thought about this for a while since at first I had no idea why that
> might happen, and a theory I came up with was that when partial linking,
> related bits of the simulator are grouped together since they're generally
> in the same directory, and then those will likely end up in the same part
> of the final binary. If those things are related, then you'll get better
> locality as far as TLB performance and maybe paging things in. gem5 is such
> a big executable that I doubt locality at that scale would make much of a
> difference at the granularity cache lines. Also possibly the type of
> relocations between those entities could be more efficient if the offset
> they need to encode is smaller?
>
> If that's true, there are two ways I've thought of where we could get that
> sort of behavior back without reintroducing partial linking, both of which
> use attributes gcc provides which I assume clang would too.
>
> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized
> particularly aggressively for performance, and "cold" makes the compiler
> optimize for size. According to the docs, both could (probably do?) put the
> items in question into separate sections where they would have better
> locality, and the "cold" functions would stay out of the way.
>
> 2. Put things in different sections explicitly with the "section"
> attribute. This could explicitly group items we'd want to show up near each
> other like what partial linking does explicitly/accidentally.
>
> A third option might be to use profiling based optimization. I don't know
> how to get gcc or clang to use that and what it requires, but I think they
> at least *can* do something along those lines. That would hopefully give
> the compiler enough information that it could figure some of these things
> out on its own.
>
> The problem with this option might be that things we don't exercise in the
> profiling (devices or CPUs or features that aren't used) may look
> unimportant, but would be very important if the configuration of the
> simulator was different.
>
> One other thing we might want to try, and I'm not sure how this would
> work, might be to get gem5 loaded in with a larger page size somehow. Given
> how big the binary is, reducing pressure on the TLB that way would probably
> make a fairly big difference in performance.
>
> Gabe
>
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s