Re: [perfmon] Cell port for Perfmon

Carl Love Wed, 28 Mar 2007 08:58:38 -0800

Dan and Will:

Yes using the trace buffer makes the reading of the performance counters
much more expensive.  The trace buffer is 128 bits wide.  So you can
store all of the counters in a single entry in the buffer.  When you go
through the buffer, you would need to extract each counter from its bit
range in the entry.

There is not hardware interrupt to let you know that the buffer is full,
i.e. there is no "trace buffer service interrupt" as Will called it.
So, when the trace buffer is full, the counters effectively stop and you
lose counts.  Now, to avoid that, you could have a kernel timer that
would periodically get called to accumulate the counters before the
trace buffer fills.  So, for argument sake lets say we did this.  Well,
for 16 bit counters, we would need configure the hardware to store the
performance counter counts to the trace buffer every 2^16 cycles just to
make we save the value before it rolls over.  This is done in hardware
so there is no overhead here.  Well, we will need to kick off the kernel
timer every 1024 * 2^16 cycles ( about 67 million cycles) to flush the
trace buffer so the counters don't "stop".  I don't know the exact cost
of calling a kernel timer but I suspect it is not much cheaper then
servicing a hardware interrupt.  The kernel timer routine will then have
to do 2048 hardware register reads to empty the trace buffer.  Note, you
have to do two 64 bit reads to read the entire 128 bit trace buffer
entry. You must mask and shift to extract the 16 bit counters, add the
16 bit count to the virtual count.  The point being the kernel timer
function is not cheep.  Furthermore, you must call the kernel timer
every 67 million cycles whether the counters are full or not because you
have no way to tell if they are full or not.  Atleast if you are using
interrupts, the interrupt will only get called when it is really needed.
The interrupt handler would only need to do 8 reads and 8 adds to
accumulate the counts.

There is another issue with using the trace buffer to implement virtual
counters.  Perfmon needs to be able to do sampling.  You need to be able
to call the overflow routine after N events so a sample can be stored.
Well, if we use the trace buffer, we do not have an interrupt mechanism
to tell us when we have seen N events.  We would have to accumulate up
the counts at a fairly fine resolution so we could check to see if there
have been N events.  Effectively, sampling in perfmon would not be
possible if we used the trace buffers for virtual counters.

There are some additional issues such as breaking the perf count
histogram functionality in the existing Cell performance counter tool.
The bottom line is we do not feel that using the trace buffer to
implement virtual counters in perfmon is a practical, low overhead
solution.  We are not going to pursue this approach.  Our take is that
the perfmon interface will expose the counters and allow the user level
performance tools to configure the counters as needed.  The
documentation will tell the user level tools that they must provide the
intelligence to configure the counters as 32 bit counters for events
such as cycles and inst retired where the interrupt overhead would be
excessive for the virtual counters.  Only in cases were the count
frequency is not really high should the user tool opt to use 16 bit
counters.  It is up to the user tool to make that decision.  

I worked on OProfile for CELL.  In that implementation, I looked at the
difficulty of adding support to the OProfile user tools to support 16
and 32 bit counters.  I looked at how often people profile on multiple
events.  Given that you can only profile on events in the same group at
a given time, I found very few cases where I could come up with more
then 4 events that would even be interesting to profile on.  There
simply wasn't enough of a compelling argument to making OProfile work
with 16/32 bit counters to justify the effort.  I went with four 32 bit
counters for CELL OProfile.

                   Carl Love

On Wed, 2007-03-28 at 09:45 -0400, William Cohen wrote:
> Dan Terpstra wrote:
> > Carl -
> > Based on your description below it sounds like the trace buffer *does* make
> > the counters wider, but at a cost. You reduce the interrupt frequency by a
> > factor of 10^3 (or 2^10) and pay the price by summing the 1024 values from
> > the trace into a 64-bit virtual counter. 1024 adds is probably a lot more
> > efficient than 1024 interrupts. Consider adding 1023 '1's. The result is
> > exactly 10 bits wide. Consider adding 1023 '65535's. The result is exactly
> > 26 bits wide. 10 extra bits of dynamic range. And 10^3 fewer interrupts.
> > You're right that sampling would still be restricted to the actual size of
> > the physical counter, but that's the same restriction as before. Seems to me
> > this could make virtualization of 16 bit counters *less* expensive.
> > I'm probably missing other hardware details that make this approach
> > impractical, but on the surface it could work.
> > 
> > BTW, glad to hear about the debugger stuff.
> > 
> > - dan
> 
> Wouldn't this make the operation of reading the performance counter more 
> expensive? Currently, perfmon2 has to paste together the accumulated values 
> from 
> interrupts and the current counter value then check that the value for 
> interrupts hasn't rolled over because of the non-atomic operation. With the 
> trace buffer scheme the read would have to scan through the buffer. This 
> could 
> still be less overhead than taking all those interrupts. The code would have 
> to 
> be careful to make sure that the scanning of the trace buffer is faster than 
> the 
> rate that the hardware can put elements in the buffer. Is there just one 
> buffered shared between all the counters? If so, the trace buffer scan will 
> need 
> to determine which counter the event is for. What happens to the counter when 
> the trace buffer service interrupt is triggered, can it take more samples or 
> does the counter freeze. If it loses counts when the buffer is filled that 
> wouldn't be very useful.
> 
> -Will
> _______________________________________________
> perfmon mailing list
> [email protected]
> http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

Re: [perfmon] Cell port for Perfmon

Reply via email to