Out of curiosity I tried a quick x86 boot test, and say that the run time with partial linking removed increased from just under 7 minutes to about 7 and a half minutes.
I thought about this for a while since at first I had no idea why that might happen, and a theory I came up with was that when partial linking, related bits of the simulator are grouped together since they're generally in the same directory, and then those will likely end up in the same part of the final binary. If those things are related, then you'll get better locality as far as TLB performance and maybe paging things in. gem5 is such a big executable that I doubt locality at that scale would make much of a difference at the granularity cache lines. Also possibly the type of relocations between those entities could be more efficient if the offset they need to encode is smaller? If that's true, there are two ways I've thought of where we could get that sort of behavior back without reintroducing partial linking, both of which use attributes gcc provides which I assume clang would too. 1. The "hot" and "cold" attributes. "hot" makes a function get optimized particularly aggressively for performance, and "cold" makes the compiler optimize for size. According to the docs, both could (probably do?) put the items in question into separate sections where they would have better locality, and the "cold" functions would stay out of the way. 2. Put things in different sections explicitly with the "section" attribute. This could explicitly group items we'd want to show up near each other like what partial linking does explicitly/accidentally. A third option might be to use profiling based optimization. I don't know how to get gcc or clang to use that and what it requires, but I think they at least *can* do something along those lines. That would hopefully give the compiler enough information that it could figure some of these things out on its own. The problem with this option might be that things we don't exercise in the profiling (devices or CPUs or features that aren't used) may look unimportant, but would be very important if the configuration of the simulator was different. One other thing we might want to try, and I'm not sure how this would work, might be to get gem5 loaded in with a larger page size somehow. Given how big the binary is, reducing pressure on the TLB that way would probably make a fairly big difference in performance. Gabe
_______________________________________________ gem5-dev mailing list -- [email protected] To unsubscribe send an email to [email protected] %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
