The penalty paid by hw_mfpr and hw_mtpr still exists, but it can
> be minimized by about 1/4 by reducing the ipr access latency from 3 cycles
> (default) to 1. I have further tried removing the isIprAccess from hw_mfpr
> and hw_mtpr to see overall benefit. canneal and blackcholes (2-5%) benefited
> the most because they spend the largest portion of cpu time waiting for
> serialized stalls, but the remaining parsec benchmarks did not receive any
> benefit (see sys_rename_stall_from_serialization as for why canneal and
> blackscholes benefit).
>
I believe as Nathan suggested the cause of these palcode instr is the
rduniq/wrunique instruction, if you look at the
palcode :

CALL_PAL_Wrunique:
        nop
        mfpr    r12, pt_pcbb            // get pcb pointer
        stq_p   r16, osfpcb_q_unique(r12)// get new value
        nop                             // Pad palshadow write
        hw_rei                          // back


You see that:

 0: OpcdecFault::hw_rei();
        format BasicOperate {
          1: hw_rei({{ xc->hwrei(); }}, IsSerializing, IsSerializeBefore);
        }
    }

I think this might be the reason why you only observed a 2%-5%
increase, hw_rei needs to also be fixed. I could be wrong. Can we
remove those flags to test is it safe?


Thanks,
Ehsan

On Mon, Oct 19, 2009 at 7:04 PM, Rick Strong <[email protected]> wrote:
> Sorry for the late reply on this subject. I figure I could chirp in with
> some data points and graphs. I have included the average IPC per core on a
> subset of the Parsec benchmarks for 2 cores, 4 cores, 8 cores, 16 cores and
> 32 cores (This is for the first 10e9 instructions of the region-of-interest)
> . The avg-IPC of blackscholes is about .91 for each core  which is much
> better than when I first saw results.  The 3 biggest contributions to
> original slow down were hw_mfpr, hw_mtpr, and trapb. Removing the stall for
> trapb was a simple edit in decoder.isa that did not result in simulation
> problems.  The penalty paid by hw_mfpr and hw_mtpr still exists, but it can
> be minimized by about 1/4 by reducing the ipr access latency from 3 cycles
> (default) to 1. I have further tried removing the isIprAccess from hw_mfpr
> and hw_mtpr to see overall benefit. canneal and blackcholes (2-5%) benefited
> the most because they spend the largest portion of cpu time waiting for
> serialized stalls, but the remaining parsec benchmarks did not receive any
> benefit (see sys_rename_stall_from_serialization as for why canneal and
> blackscholes benefit).
>
> The most significant factor so far in improving performance for me was to
> use simlarge instead of simsmall. However, the problem exists that running
> all of simlarge is impractical for M5 as it takes a very long time.  The
> other problem is that most of the analysis above is based on running a
> portion of entire execution of the parsec benchmarks which means that I am
> probably only capturing a few kernels of the larger overall program. The
> approach that makes the most sense in my mind is to improve the parallel
> benchmark simulation methodology and then to examine the source of stalls in
> the cpu or memory subsystem. I am current working on adding a
> systematically-uniform sampling policy that makes use of 100s of simulation
> points and simulates for a subset of total time. The idea is that I can then
> have some statistical guarantees that I am capturing important behavior for
> a working set that is sufficiently large to stress more than 8 cores and
> that I am not falling victim to the performance penalties of the linux
> scheduler performing load balancing or some other behavior. Once this is
> done, I would be happy to followup on the IPC of the different benchmarks.
>
> -Rick
>
> ef wrote:
>>
>> Hello,
>>
>> Here is a bit of an update:
>>
>> I went ahead and tracked the number of hw_m*pr instructions using
>> Blackscholes, simlarge 2 cores. 5% of instructions fetched where
>> hw_m*pr instructions. I think this is a huge amount considering that
>> is 1 in 20 instructions, since I am using 8 wide machine, thats a
>> stall of 3/5 of pipe stages one out of 3 cycles (stalling till the ROB
>> is empty)...Not to mention the IPC is atrocious, which I suspect is
>> caused by these serializing stalls.
>>
>> I went ahead and obj dumped my binaries, and looked at the
>> decoder.isa. It seems that the only instructions from the parsec
>> binaries generating hw_m*pr instructions is:
>>
>> rdunique (my code has lots of them) wruniqe(my code has 2) , halt(not
>> important) and callsys(has quite a bit)
>> Anyways the instruction count of these instructions is pointless due
>> to branching loops and jumps. I was just trying to get some type of
>> quantification.
>>
>> Unfortunately I was unable to find to much literature on these
>> instructions (rdunique, wrunique) using google,  I need to work on
>> that.
>>
>> I have 12 out of 13 PARSEC 2.1 benchmarks compiled and I consider them
>> useless due to this issue. I am surprised that this issue has never
>> been brought except by Rick Strong.
>>
>> My first thought was try to force the compiler to
>> remove/reduce/optimize these instruction (rduniq,wduniq), however I
>> dont think that is possible, so compiler solution seems to be out of
>> the question. There is a way to replace these rduniq an wduniq
>> instructions, the following is from gcc.org
>> " The following builtins are available on systems that use the OSF/1
>> PALcode.  Normally they invoke the `rduniq' and `wruniq' PAL calls,
>> but
>> when invoked with `-mtls-kernel', they invoke `rdval' and `wrval'."
>>
>> I dont think rdval and wrval are implemented into M5 so that seems
>> useless.
>>
>> My next step I guess, is to attempt to fix M5. Unfortunately my C++
>> skill level is pretty low, I would like to implement a solution based
>> on what I can do, but I don't think I have that luxury. I am in the
>> process of researching  Nathan's solutions and trying to implement
>> them.  I am assuming the scoreboard is the most efficient
>> implementation and realistic one (this correct?). I will try looking
>> into that solution first..
>>
>> Any thoughts, advice and/or suggestions would be greatly appreciated.
>>
>> Thanks,
>> EF
>>
>> On Mon, Oct 12, 2009 at 10:11 PM, nathan binkert <[email protected]> wrote:
>>
>>>
>>> My guess is that this is the result of calling rdunique and wrunique.
>>> These pal instructions keep track of the currently running thread.
>>> They more or less just access a single internal pal temp register.
>>> There are a number of things that could potentially be done to fix the
>>> slowness here.  You could create an actual renamed register in the o3
>>> model and make those palcalls access that special register.  If that
>>> weren't enough, you could add a more generalized facility for renaming
>>> pal temp registers (there are many that are simply treated as
>>> registers) and allow mfpr and mtpr to not be serializing.  Another
>>> option is to make some sort of "barrier" between pal instructions that
>>> allows them to not necessarily be serializing, but forces them to be
>>> executed in order.  You could take that a step further if necessary
>>> and implement a scoreboard that indicates which instructions have to
>>> wait for others (which is how the ev6 really does it).
>>>
>>> None of these options are particularly simple, but they aren't overly
>>> massive changes either.
>>>
>>>  Nate
>>>
>>>
>>>>
>>>> Based on the results I am getting, for PARSEC benchmarks, the OoO
>>>> preformance is really bad, there are to many hw_mfpr and hw_mtpr,
>>>> instructions. I am trying to figure out why I am in the PALcode so
>>>> often (any ideas on how to figure this out?). I am running
>>>> Blackscholes, which is a relatively simple PARSEC benchmark.
>>>>
>>>> I need to do more research, but I dont think this is caused at all for
>>>> itb and dtb misses (i made them really large just in case).
>>>> Right now for blackscholes, which isnt close to finish executing
>>>> (about 30%) I have it so it will execute  1e9 instructions (running
>>>> two cores),  From my perl scripts it seems to have it fetched 13
>>>> million hw_mfpr and mtpr instructions (fetched, not committed). There
>>>> is something really wrong with that ratio.
>>>>
>>>
>>> _______________________________________________
>>> m5-users mailing list
>>> [email protected]
>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>>
>>>
>>
>> _______________________________________________
>> m5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>>
>>
>
>
> _______________________________________________
> m5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>
_______________________________________________
m5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to