Out of curiosity I tried a quick x86 boot test, and say that the run time
with partial linking removed increased from just under 7 minutes to about 7
and a half minutes.

I thought about this for a while since at first I had no idea why that
might happen, and a theory I came up with was that when partial linking,
related bits of the simulator are grouped together since they're generally
in the same directory, and then those will likely end up in the same part
of the final binary. If those things are related, then you'll get better
locality as far as TLB performance and maybe paging things in. gem5 is such
a big executable that I doubt locality at that scale would make much of a
difference at the granularity cache lines. Also possibly the type of
relocations between those entities could be more efficient if the offset
they need to encode is smaller?

If that's true, there are two ways I've thought of where we could get that
sort of behavior back without reintroducing partial linking, both of which
use attributes gcc provides which I assume clang would too.

1. The "hot" and "cold" attributes. "hot" makes a function get optimized
particularly aggressively for performance, and "cold" makes the compiler
optimize for size. According to the docs, both could (probably do?) put the
items in question into separate sections where they would have better
locality, and the "cold" functions would stay out of the way.

2. Put things in different sections explicitly with the "section"
attribute. This could explicitly group items we'd want to show up near each
other like what partial linking does explicitly/accidentally.

A third option might be to use profiling based optimization. I don't know
how to get gcc or clang to use that and what it requires, but I think they
at least *can* do something along those lines. That would hopefully give
the compiler enough information that it could figure some of these things
out on its own.

The problem with this option might be that things we don't exercise in the
profiling (devices or CPUs or features that aren't used) may look
unimportant, but would be very important if the configuration of the
simulator was different.

One other thing we might want to try, and I'm not sure how this would work,
might be to get gem5 loaded in with a larger page size somehow. Given how
big the binary is, reducing pressure on the TLB that way would probably
make a fairly big difference in performance.

Gabe
_______________________________________________
gem5-dev mailing list -- [email protected]
To unsubscribe send an email to [email protected]
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to