Yeah, that wouldn't hurt :-). The size difference of the .text section,
what I assume is the most relevant for icache misses, is only a 27712 byte
*decrease*, from 0x16ab445 to 0x16a4805, so the size without partial
linking actually went down slightly. I think what may actually be happening
is that the layout of the text section is different in memory, and hot
items are aliasing in the cache and kicking each other out.

As far as the pybind11 fix, I don't think that will actually change the
final binary size very much, and the savings that will get us will be
primarily during the actual build. The way it is now, it's like every tile
that includes pybind11 directly or indirectly (which is a lot of them)
links against a copy of the pybind11 "library". That means we have hundreds
of copies of the common pybind11 code being compiled into and left in .o
files across the build directory. Then when the final link happens, the
linker not only has to wade through all that and load all of it off disk,
it also has to find and consolidate all those extra copies so that we get
back down to exactly one instance of the common pybind11 machinery.

With Ciro's change, the common stuff (the stuff that doesn't need to be
templates) will be seperated and compiled exactly once, have exactly one
copy in the build directory, and be linked into the final binary exactly
once with no extra copies to load, identify and purge. The build time and
space savings should be substantial (and are according to Ciro's earlier
measurements), but the actual final binary should end up being *roughly*
about the same.

For reference, at the end of scons fixup branch, the size of just the
build/X86/python/_m5 directory is 1.6GB (which is the worst offender),
while the final binary itself is only 222MB.

Gabe

On Wed, Feb 10, 2021 at 7:41 AM Jason Lowe-Power <[email protected]>
wrote:

> This is interesting! My takeaway: we need bigger i-caches to run gem5 ;).
>
> Is the binary about the same size for the two conditions? Could it be
> simply that the instruction working set is bigger when not using partial
> linking?
>
> That said, a 3% performance difference isn't a big problem, in my opinion.
> This analysis also gives us some interesting directions for future
> optimizations. I wonder how much the library version of pybind will help
> since that will significantly reduce the instruction footprint.
>
> Jason
>
> On Tue, Feb 9, 2021 at 10:56 PM Gabe Black via gem5-dev <[email protected]>
> wrote:
>
>> I did some measurements before and after, and I noticed a few things.
>> First, the iTLB-load-misses stat drops form 0.25% all the way down to
>> 0.02%. The frontend and backend stall cycles went down from 1.72% => 1.27%
>> and 13.90% => 10.62% respectively. The L1-icache-load-misses went *up* from
>> 1.74% => 2.77%.
>>
>> So it looks like performance is generally about the same or a little
>> better in most metrics, but for some reason icache hit rate drops.
>>
>> Performance measurements with partial linking:
>>
>>         429,882.68 msec task-clock:u              #    1.000 CPUs
>> utilized
>>                  0      context-switches:u        #    0.000 K/sec
>>
>>                  0      cpu-migrations:u          #    0.000 K/sec
>>
>>            145,986      page-faults:u             #    0.340 K/sec
>>
>>  1,830,956,683,109      cycles:u                  #    4.259 GHz
>>              (35.71%)
>>     31,472,946,642      stalled-cycles-frontend:u #    1.72% frontend
>> cycles idle     (35.71%)
>>    254,440,746,368      stalled-cycles-backend:u  #   13.90% backend
>> cycles idle      (35.71%)
>>  4,117,921,862,700      instructions:u            #    2.25  insn per
>> cycle
>>                                                   #    0.06  stalled
>> cycles per insn  (35.71%)
>>    773,059,098,367      branches:u                # 1798.303 M/sec
>>              (35.71%)
>>      2,775,345,450      branch-misses:u           #    0.36% of all
>> branches          (35.71%)
>>  2,329,109,097,524      L1-dcache-loads:u         # 5418.011 M/sec
>>              (35.71%)
>>     24,907,172,614      L1-dcache-load-misses:u   #    1.07% of all
>> L1-dcache accesses  (35.71%)
>>    <not supported>      LLC-loads:u
>>
>>    <not supported>      LLC-load-misses:u
>>
>>    872,678,362,265      L1-icache-loads:u         # 2030.038 M/sec
>>              (35.71%)
>>     15,221,564,231      L1-icache-load-misses:u   #    1.74% of all
>> L1-icache accesses  (35.71%)
>>     48,763,102,717      dTLB-loads:u              #  113.434 M/sec
>>              (35.71%)
>>         75,459,133      dTLB-load-misses:u        #    0.15% of all dTLB
>> cache accesses  (35.71%)
>>      8,416,573,693      iTLB-loads:u              #   19.579 M/sec
>>              (35.72%)
>>         20,650,906      iTLB-load-misses:u        #    0.25% of all iTLB
>> cache accesses  (35.72%)
>>
>>      429.911532621 seconds time elapsed
>>
>>      428.611864000 seconds user
>>        0.199257000 seconds sys
>>
>>
>> Performance measurements without partial linking:
>>
>>         444,598.61 msec task-clock:u              #    1.000 CPUs
>> utilized
>>                  0      context-switches:u        #    0.000 K/sec
>>
>>                  0      cpu-migrations:u          #    0.000 K/sec
>>
>>            145,528      page-faults:u             #    0.327 K/sec
>>
>>  1,907,560,568,869      cycles:u                  #    4.291 GHz
>>              (35.71%)
>>     24,156,412,003      stalled-cycles-frontend:u #    1.27% frontend
>> cycles idle     (35.72%)
>>    202,601,144,555      stalled-cycles-backend:u  #   10.62% backend
>> cycles idle      (35.72%)
>>  4,118,200,832,359      instructions:u            #    2.16  insn per
>> cycle
>>                                                   #    0.05  stalled
>> cycles per insn  (35.72%)
>>    773,117,144,029      branches:u                # 1738.910 M/sec
>>              (35.72%)
>>      2,727,637,567      branch-misses:u           #    0.35% of all
>> branches          (35.71%)
>>  2,326,960,449,159      L1-dcache-loads:u         # 5233.845 M/sec
>>              (35.71%)
>>     26,778,818,764      L1-dcache-load-misses:u   #    1.15% of all
>> L1-dcache accesses  (35.71%)
>>    <not supported>      LLC-loads:u
>>
>>    <not supported>      LLC-load-misses:u
>>
>>    903,186,314,629      L1-icache-loads:u         # 2031.465 M/sec
>>              (35.71%)
>>     25,017,115,665      L1-icache-load-misses:u   #    2.77% of all
>> L1-icache accesses  (35.71%)
>>     50,448,039,415      dTLB-loads:u              #  113.469 M/sec
>>              (35.71%)
>>         78,186,127      dTLB-load-misses:u        #    0.15% of all dTLB
>> cache accesses  (35.71%)
>>      9,419,644,114      iTLB-loads:u              #   21.187 M/sec
>>              (35.71%)
>>          1,479,281      iTLB-load-misses:u        #    0.02% of all iTLB
>> cache accesses  (35.71%)
>>
>>      444.623341115 seconds time elapsed
>>
>>      443.313786000 seconds user
>>        0.256109000 seconds sys
>>
>> On Sat, Feb 6, 2021 at 5:20 AM Gabe Black <[email protected]> wrote:
>>
>>> Out of curiosity I tried a quick x86 boot test, and say that the run
>>> time with partial linking removed increased from just under 7 minutes to
>>> about 7 and a half minutes.
>>>
>>> I thought about this for a while since at first I had no idea why that
>>> might happen, and a theory I came up with was that when partial linking,
>>> related bits of the simulator are grouped together since they're generally
>>> in the same directory, and then those will likely end up in the same part
>>> of the final binary. If those things are related, then you'll get better
>>> locality as far as TLB performance and maybe paging things in. gem5 is such
>>> a big executable that I doubt locality at that scale would make much of a
>>> difference at the granularity cache lines. Also possibly the type of
>>> relocations between those entities could be more efficient if the offset
>>> they need to encode is smaller?
>>>
>>> If that's true, there are two ways I've thought of where we could get
>>> that sort of behavior back without reintroducing partial linking, both of
>>> which use attributes gcc provides which I assume clang would too.
>>>
>>> 1. The "hot" and "cold" attributes. "hot" makes a function get optimized
>>> particularly aggressively for performance, and "cold" makes the compiler
>>> optimize for size. According to the docs, both could (probably do?) put the
>>> items in question into separate sections where they would have better
>>> locality, and the "cold" functions would stay out of the way.
>>>
>>> 2. Put things in different sections explicitly with the "section"
>>> attribute. This could explicitly group items we'd want to show up near each
>>> other like what partial linking does explicitly/accidentally.
>>>
>>> A third option might be to use profiling based optimization. I don't
>>> know how to get gcc or clang to use that and what it requires, but I think
>>> they at least *can* do something along those lines. That would hopefully
>>> give the compiler enough information that it could figure some of these
>>> things out on its own.
>>>
>>> The problem with this option might be that things we don't exercise in
>>> the profiling (devices or CPUs or features that aren't used) may look
>>> unimportant, but would be very important if the configuration of the
>>> simulator was different.
>>>
>>> One other thing we might want to try, and I'm not sure how this would
>>> work, might be to get gem5 loaded in with a larger page size somehow. Given
>>> how big the binary is, reducing pressure on the TLB that way would probably
>>> make a fairly big difference in performance.
>>>
>>> Gabe
>>>
>> _______________________________________________
>> gem5-dev mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to