This is interesting! My takeaway: we need bigger i-caches to run gem5 ;).

Is the binary about the same size for the two conditions? Could it be
simply that the instruction working set is bigger when not using partial
linking?

That said, a 3% performance difference isn't a big problem, in my opinion.
This analysis also gives us some interesting directions for future
optimizations. I wonder how much the library version of pybind will help
since that will significantly reduce the instruction footprint.

Jason

On Tue, Feb 9, 2021 at 10:56 PM Gabe Black via gem5-dev <gem5-dev@gem5.org>
wrote:

> I did some measurements before and after, and I noticed a few things.
> First, the iTLB-load-misses stat drops form 0.25% all the way down to
> 0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27%
> and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from
> 1.74% => 2.77%.
>
> So it looks like performance is generally about the same or a little
> better in most metrics, but for some reason icache hit rate drops.
>
> Performance measurements with partial linking:
>
>         429,882.68 msec task-clock:u              #    1.000 CPUs utilized
>
>                  0      context-switches:u        #    0.000 K/sec
>
>                  0      cpu-migrations:u          #    0.000 K/sec
>
>            145,986      page-faults:u             #    0.340 K/sec
>
>  1,830,956,683,109      cycles:u                  #    4.259 GHz
>            (35.71%)
>     31,472,946,642      stalled-cycles-frontend:u #    1.72% frontend
> cycles idle     (35.71%)
>    254,440,746,368      stalled-cycles-backend:u  #   13.90% backend
> cycles idle      (35.71%)
>  4,117,921,862,700      instructions:u            #    2.25  insn per
> cycle
>                                                   #    0.06  stalled
> cycles per insn  (35.71%)
>    773,059,098,367      branches:u                # 1798.303 M/sec
>            (35.71%)
>      2,775,345,450      branch-misses:u           #    0.36% of all
> branches          (35.71%)
>  2,329,109,097,524      L1-dcache-loads:u         # 5418.011 M/sec
>            (35.71%)
>     24,907,172,614      L1-dcache-load-misses:u   #    1.07% of all
> L1-dcache accesses  (35.71%)
>    <not supported>      LLC-loads:u
>
>    <not supported>      LLC-load-misses:u
>
>    872,678,362,265      L1-icache-loads:u         # 2030.038 M/sec
>            (35.71%)
>     15,221,564,231      L1-icache-load-misses:u   #    1.74% of all
> L1-icache accesses  (35.71%)
>     48,763,102,717      dTLB-loads:u              #  113.434 M/sec
>            (35.71%)
>         75,459,133      dTLB-load-misses:u        #    0.15% of all dTLB
> cache accesses  (35.71%)
>      8,416,573,693      iTLB-loads:u              #   19.579 M/sec
>            (35.72%)
>         20,650,906      iTLB-load-misses:u        #    0.25% of all iTLB
> cache accesses  (35.72%)
>
>      429.911532621 seconds time elapsed
>
>      428.611864000 seconds user
>        0.199257000 seconds sys
>
>
> Performance measurements without partial linking:
>
>         444,598.61 msec task-clock:u              #    1.000 CPUs utilized
>
>                  0      context-switches:u        #    0.000 K/sec
>
>                  0      cpu-migrations:u          #    0.000 K/sec
>
>            145,528      page-faults:u             #    0.327 K/sec
>
>  1,907,560,568,869      cycles:u                  #    4.291 GHz
>            (35.71%)
>     24,156,412,003      stalled-cycles-frontend:u #    1.27% frontend
> cycles idle     (35.72%)
>    202,601,144,555      stalled-cycles-backend:u  #   10.62% backend
> cycles idle      (35.72%)
>  4,118,200,832,359      instructions:u            #    2.16  insn per
> cycle
>                                                   #    0.05  stalled
> cycles per insn  (35.72%)
>    773,117,144,029      branches:u                # 1738.910 M/sec
>            (35.72%)
>      2,727,637,567      branch-misses:u           #    0.35% of all
> branches          (35.71%)
>  2,326,960,449,159      L1-dcache-loads:u         # 5233.845 M/sec
>            (35.71%)
>     26,778,818,764      L1-dcache-load-misses:u   #    1.15% of all
> L1-dcache accesses  (35.71%)
>    <not supported>      LLC-loads:u
>
>    <not supported>      LLC-load-misses:u
>
>    903,186,314,629      L1-icache-loads:u         # 2031.465 M/sec
>            (35.71%)
>     25,017,115,665      L1-icache-load-misses:u   #    2.77% of all
> L1-icache accesses  (35.71%)
>     50,448,039,415      dTLB-loads:u              #  113.469 M/sec
>            (35.71%)
>         78,186,127      dTLB-load-misses:u        #    0.15% of all dTLB
> cache accesses  (35.71%)
>      9,419,644,114      iTLB-loads:u              #   21.187 M/sec
>            (35.71%)
>          1,479,281      iTLB-load-misses:u        #    0.02% of all iTLB
> cache accesses  (35.71%)
>
>      444.623341115 seconds time elapsed
>
>      443.313786000 seconds user
>        0.256109000 seconds sys
>
> On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <gabe.bl...@gmail.com> wrote:
>
>> Out of curiosity I tried a quick x86 boot test, and say that the run time
>> with partial linking removed increased from just under 7 minutes to about 7
>> and a half minutes.
>>
>> I thought about this for a while since at first I had no idea why that
>> might happen, and a theory I came up with was that when partial linking,
>> related bits of the simulator are grouped together since they're generally
>> in the same directory, and then those will likely end up in the same part
>> of the final binary. If those things are related, then you'll get better
>> locality as far as TLB performance and maybe paging things in. gem5 is such
>> a big executable that I doubt locality at that scale would make much of a
>> difference at the granularity cache lines. Also possibly the type of
>> relocations between those entities could be more efficient if the offset
>> they need to encode is smaller?
>>
>> If that's true, there are two ways I've thought of where we could get
>> that sort of behavior back without reintroducing partial linking, both of
>> which use attributes gcc provides which I assume clang would too.
>>
>> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized
>> particularly aggressively for performance, and "cold" makes the compiler
>> optimize for size. According to the docs, both could (probably do?) put the
>> items in question into separate sections where they would have better
>> locality, and the "cold" functions would stay out of the way.
>>
>> 2. Put things in different sections explicitly with the "section"
>> attribute. This could explicitly group items we'd want to show up near each
>> other like what partial linking does explicitly/accidentally.
>>
>> A third option might be to use profiling based optimization. I don't know
>> how to get gcc or clang to use that and what it requires, but I think they
>> at least *can* do something along those lines. That would hopefully give
>> the compiler enough information that it could figure some of these things
>> out on its own.
>>
>> The problem with this option might be that things we don't exercise in
>> the profiling (devices or CPUs or features that aren't used) may look
>> unimportant, but would be very important if the configuration of the
>> simulator was different.
>>
>> One other thing we might want to try, and I'm not sure how this would
>> work, might be to get gem5 loaded in with a larger page size somehow. Given
>> how big the binary is, reducing pressure on the TLB that way would probably
>> make a fairly big difference in performance.
>>
>> Gabe
>>
> _______________________________________________
> gem5-dev mailing list -- gem5-dev@gem5.org
> To unsubscribe send an email to gem5-dev-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
_______________________________________________
gem5-dev mailing list -- gem5-dev@gem5.org
To unsubscribe send an email to gem5-dev-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to