Hello, Here is a bit of an update:
I went ahead and tracked the number of hw_m*pr instructions using Blackscholes, simlarge 2 cores. 5% of instructions fetched where hw_m*pr instructions. I think this is a huge amount considering that is 1 in 20 instructions, since I am using 8 wide machine, thats a stall of 3/5 of pipe stages one out of 3 cycles (stalling till the ROB is empty)...Not to mention the IPC is atrocious, which I suspect is caused by these serializing stalls. I went ahead and obj dumped my binaries, and looked at the decoder.isa. It seems that the only instructions from the parsec binaries generating hw_m*pr instructions is: rdunique (my code has lots of them) wruniqe(my code has 2) , halt(not important) and callsys(has quite a bit) Anyways the instruction count of these instructions is pointless due to branching loops and jumps. I was just trying to get some type of quantification. Unfortunately I was unable to find to much literature on these instructions (rdunique, wrunique) using google, I need to work on that. I have 12 out of 13 PARSEC 2.1 benchmarks compiled and I consider them useless due to this issue. I am surprised that this issue has never been brought except by Rick Strong. My first thought was try to force the compiler to remove/reduce/optimize these instruction (rduniq,wduniq), however I dont think that is possible, so compiler solution seems to be out of the question. There is a way to replace these rduniq an wduniq instructions, the following is from gcc.org " The following builtins are available on systems that use the OSF/1 PALcode. Normally they invoke the `rduniq' and `wruniq' PAL calls, but when invoked with `-mtls-kernel', they invoke `rdval' and `wrval'." I dont think rdval and wrval are implemented into M5 so that seems useless. My next step I guess, is to attempt to fix M5. Unfortunately my C++ skill level is pretty low, I would like to implement a solution based on what I can do, but I don't think I have that luxury. I am in the process of researching Nathan's solutions and trying to implement them. I am assuming the scoreboard is the most efficient implementation and realistic one (this correct?). I will try looking into that solution first.. Any thoughts, advice and/or suggestions would be greatly appreciated. Thanks, EF On Mon, Oct 12, 2009 at 10:11 PM, nathan binkert <[email protected]> wrote: > My guess is that this is the result of calling rdunique and wrunique. > These pal instructions keep track of the currently running thread. > They more or less just access a single internal pal temp register. > There are a number of things that could potentially be done to fix the > slowness here. You could create an actual renamed register in the o3 > model and make those palcalls access that special register. If that > weren't enough, you could add a more generalized facility for renaming > pal temp registers (there are many that are simply treated as > registers) and allow mfpr and mtpr to not be serializing. Another > option is to make some sort of "barrier" between pal instructions that > allows them to not necessarily be serializing, but forces them to be > executed in order. You could take that a step further if necessary > and implement a scoreboard that indicates which instructions have to > wait for others (which is how the ev6 really does it). > > None of these options are particularly simple, but they aren't overly > massive changes either. > > Nate > >> Based on the results I am getting, for PARSEC benchmarks, the OoO >> preformance is really bad, there are to many hw_mfpr and hw_mtpr, >> instructions. I am trying to figure out why I am in the PALcode so >> often (any ideas on how to figure this out?). I am running >> Blackscholes, which is a relatively simple PARSEC benchmark. >> >> I need to do more research, but I dont think this is caused at all for >> itb and dtb misses (i made them really large just in case). >> Right now for blackscholes, which isnt close to finish executing >> (about 30%) I have it so it will execute 1e9 instructions (running >> two cores), From my perl scripts it seems to have it fetched 13 >> million hw_mfpr and mtpr instructions (fetched, not committed). There >> is something really wrong with that ratio. > _______________________________________________ > m5-users mailing list > [email protected] > http://m5sim.org/cgi-bin/mailman/listinfo/m5-users > _______________________________________________ m5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
