RE: [perfmon] Cell port for Perfmon

Dan Terpstra Wed, 28 Mar 2007 11:35:54 -0800

Oh well...
4 32-bit counters virtualized to 64-bit isn't much worse than Opteron, and
still better than Intel Core...
Thanks for the detailed explanation, Carl.
- d


> -----Original Message-----
> From: Carl Love [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 28, 2007 1:53 PM
> To: William Cohen
> Cc: Dan Terpstra; [EMAIL PROTECTED]
> Subject: Re: [perfmon] Cell port for Perfmon
> 
> Dan and Will:
> 
> Yes using the trace buffer makes the reading of the performance counters
> much more expensive.  The trace buffer is 128 bits wide.  So you can
> store all of the counters in a single entry in the buffer.  When you go
> through the buffer, you would need to extract each counter from its bit
> range in the entry.
> 
> There is not hardware interrupt to let you know that the buffer is full,
> i.e. there is no "trace buffer service interrupt" as Will called it.
> So, when the trace buffer is full, the counters effectively stop and you
> lose counts.  Now, to avoid that, you could have a kernel timer that
> would periodically get called to accumulate the counters before the
> trace buffer fills.  So, for argument sake lets say we did this.  Well,
> for 16 bit counters, we would need configure the hardware to store the
> performance counter counts to the trace buffer every 2^16 cycles just to
> make we save the value before it rolls over.  This is done in hardware
> so there is no overhead here.  Well, we will need to kick off the kernel
> timer every 1024 * 2^16 cycles ( about 67 million cycles) to flush the
> trace buffer so the counters don't "stop".  I don't know the exact cost
> of calling a kernel timer but I suspect it is not much cheaper then
> servicing a hardware interrupt.  The kernel timer routine will then have
> to do 2048 hardware register reads to empty the trace buffer.  Note, you
> have to do two 64 bit reads to read the entire 128 bit trace buffer
> entry. You must mask and shift to extract the 16 bit counters, add the
> 16 bit count to the virtual count.  The point being the kernel timer
> function is not cheep.  Furthermore, you must call the kernel timer
> every 67 million cycles whether the counters are full or not because you
> have no way to tell if they are full or not.  Atleast if you are using
> interrupts, the interrupt will only get called when it is really needed.
> The interrupt handler would only need to do 8 reads and 8 adds to
> accumulate the counts.
> 
> There is another issue with using the trace buffer to implement virtual
> counters.  Perfmon needs to be able to do sampling.  You need to be able
> to call the overflow routine after N events so a sample can be stored.
> Well, if we use the trace buffer, we do not have an interrupt mechanism
> to tell us when we have seen N events.  We would have to accumulate up
> the counts at a fairly fine resolution so we could check to see if there
> have been N events.  Effectively, sampling in perfmon would not be
> possible if we used the trace buffers for virtual counters.
> 
> There are some additional issues such as breaking the perf count
> histogram functionality in the existing Cell performance counter tool.
> The bottom line is we do not feel that using the trace buffer to
> implement virtual counters in perfmon is a practical, low overhead
> solution.  We are not going to pursue this approach.  Our take is that
> the perfmon interface will expose the counters and allow the user level
> performance tools to configure the counters as needed.  The
> documentation will tell the user level tools that they must provide the
> intelligence to configure the counters as 32 bit counters for events
> such as cycles and inst retired where the interrupt overhead would be
> excessive for the virtual counters.  Only in cases were the count
> frequency is not really high should the user tool opt to use 16 bit
> counters.  It is up to the user tool to make that decision.
> 
> I worked on OProfile for CELL.  In that implementation, I looked at the
> difficulty of adding support to the OProfile user tools to support 16
> and 32 bit counters.  I looked at how often people profile on multiple
> events.  Given that you can only profile on events in the same group at
> a given time, I found very few cases where I could come up with more
> then 4 events that would even be interesting to profile on.  There
> simply wasn't enough of a compelling argument to making OProfile work
> with 16/32 bit counters to justify the effort.  I went with four 32 bit
> counters for CELL OProfile.
> 
>                    Carl Love
> 
> On Wed, 2007-03-28 at 09:45 -0400, William Cohen wrote:
> > Dan Terpstra wrote:
> > > Carl -
> > > Based on your description below it sounds like the trace buffer *does*
> make
> > > the counters wider, but at a cost. You reduce the interrupt frequency
> by a
> > > factor of 10^3 (or 2^10) and pay the price by summing the 1024 values
> from
> > > the trace into a 64-bit virtual counter. 1024 adds is probably a lot
> more
> > > efficient than 1024 interrupts. Consider adding 1023 '1's. The result
> is
> > > exactly 10 bits wide. Consider adding 1023 '65535's. The result is
> exactly
> > > 26 bits wide. 10 extra bits of dynamic range. And 10^3 fewer
> interrupts.
> > > You're right that sampling would still be restricted to the actual
> size of
> > > the physical counter, but that's the same restriction as before. Seems
> to me
> > > this could make virtualization of 16 bit counters *less* expensive.
> > > I'm probably missing other hardware details that make this approach
> > > impractical, but on the surface it could work.
> > >
> > > BTW, glad to hear about the debugger stuff.
> > >
> > > - dan
> >
> > Wouldn't this make the operation of reading the performance counter more
> > expensive? Currently, perfmon2 has to paste together the accumulated
> values from
> > interrupts and the current counter value then check that the value for
> > interrupts hasn't rolled over because of the non-atomic operation. With
> the
> > trace buffer scheme the read would have to scan through the buffer. This
> could
> > still be less overhead than taking all those interrupts. The code would
> have to
> > be careful to make sure that the scanning of the trace buffer is faster
> than the
> > rate that the hardware can put elements in the buffer. Is there just one
> > buffered shared between all the counters? If so, the trace buffer scan
> will need
> > to determine which counter the event is for. What happens to the counter
> when
> > the trace buffer service interrupt is triggered, can it take more
> samples or
> > does the counter freeze. If it loses counts when the buffer is filled
> that
> > wouldn't be very useful.
> >
> > -Will
> > _______________________________________________
> > perfmon mailing list
> > [email protected]
> > http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

RE: [perfmon] Cell port for Perfmon

Reply via email to