Oh well... 4 32-bit counters virtualized to 64-bit isn't much worse than Opteron, and still better than Intel Core... Thanks for the detailed explanation, Carl. - d
> -----Original Message----- > From: Carl Love [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 28, 2007 1:53 PM > To: William Cohen > Cc: Dan Terpstra; [EMAIL PROTECTED] > Subject: Re: [perfmon] Cell port for Perfmon > > Dan and Will: > > Yes using the trace buffer makes the reading of the performance counters > much more expensive. The trace buffer is 128 bits wide. So you can > store all of the counters in a single entry in the buffer. When you go > through the buffer, you would need to extract each counter from its bit > range in the entry. > > There is not hardware interrupt to let you know that the buffer is full, > i.e. there is no "trace buffer service interrupt" as Will called it. > So, when the trace buffer is full, the counters effectively stop and you > lose counts. Now, to avoid that, you could have a kernel timer that > would periodically get called to accumulate the counters before the > trace buffer fills. So, for argument sake lets say we did this. Well, > for 16 bit counters, we would need configure the hardware to store the > performance counter counts to the trace buffer every 2^16 cycles just to > make we save the value before it rolls over. This is done in hardware > so there is no overhead here. Well, we will need to kick off the kernel > timer every 1024 * 2^16 cycles ( about 67 million cycles) to flush the > trace buffer so the counters don't "stop". I don't know the exact cost > of calling a kernel timer but I suspect it is not much cheaper then > servicing a hardware interrupt. The kernel timer routine will then have > to do 2048 hardware register reads to empty the trace buffer. Note, you > have to do two 64 bit reads to read the entire 128 bit trace buffer > entry. You must mask and shift to extract the 16 bit counters, add the > 16 bit count to the virtual count. The point being the kernel timer > function is not cheep. Furthermore, you must call the kernel timer > every 67 million cycles whether the counters are full or not because you > have no way to tell if they are full or not. Atleast if you are using > interrupts, the interrupt will only get called when it is really needed. > The interrupt handler would only need to do 8 reads and 8 adds to > accumulate the counts. > > There is another issue with using the trace buffer to implement virtual > counters. Perfmon needs to be able to do sampling. You need to be able > to call the overflow routine after N events so a sample can be stored. > Well, if we use the trace buffer, we do not have an interrupt mechanism > to tell us when we have seen N events. We would have to accumulate up > the counts at a fairly fine resolution so we could check to see if there > have been N events. Effectively, sampling in perfmon would not be > possible if we used the trace buffers for virtual counters. > > There are some additional issues such as breaking the perf count > histogram functionality in the existing Cell performance counter tool. > The bottom line is we do not feel that using the trace buffer to > implement virtual counters in perfmon is a practical, low overhead > solution. We are not going to pursue this approach. Our take is that > the perfmon interface will expose the counters and allow the user level > performance tools to configure the counters as needed. The > documentation will tell the user level tools that they must provide the > intelligence to configure the counters as 32 bit counters for events > such as cycles and inst retired where the interrupt overhead would be > excessive for the virtual counters. Only in cases were the count > frequency is not really high should the user tool opt to use 16 bit > counters. It is up to the user tool to make that decision. > > I worked on OProfile for CELL. In that implementation, I looked at the > difficulty of adding support to the OProfile user tools to support 16 > and 32 bit counters. I looked at how often people profile on multiple > events. Given that you can only profile on events in the same group at > a given time, I found very few cases where I could come up with more > then 4 events that would even be interesting to profile on. There > simply wasn't enough of a compelling argument to making OProfile work > with 16/32 bit counters to justify the effort. I went with four 32 bit > counters for CELL OProfile. > > Carl Love > > On Wed, 2007-03-28 at 09:45 -0400, William Cohen wrote: > > Dan Terpstra wrote: > > > Carl - > > > Based on your description below it sounds like the trace buffer *does* > make > > > the counters wider, but at a cost. You reduce the interrupt frequency > by a > > > factor of 10^3 (or 2^10) and pay the price by summing the 1024 values > from > > > the trace into a 64-bit virtual counter. 1024 adds is probably a lot > more > > > efficient than 1024 interrupts. Consider adding 1023 '1's. The result > is > > > exactly 10 bits wide. Consider adding 1023 '65535's. The result is > exactly > > > 26 bits wide. 10 extra bits of dynamic range. And 10^3 fewer > interrupts. > > > You're right that sampling would still be restricted to the actual > size of > > > the physical counter, but that's the same restriction as before. Seems > to me > > > this could make virtualization of 16 bit counters *less* expensive. > > > I'm probably missing other hardware details that make this approach > > > impractical, but on the surface it could work. > > > > > > BTW, glad to hear about the debugger stuff. > > > > > > - dan > > > > Wouldn't this make the operation of reading the performance counter more > > expensive? Currently, perfmon2 has to paste together the accumulated > values from > > interrupts and the current counter value then check that the value for > > interrupts hasn't rolled over because of the non-atomic operation. With > the > > trace buffer scheme the read would have to scan through the buffer. This > could > > still be less overhead than taking all those interrupts. The code would > have to > > be careful to make sure that the scanning of the trace buffer is faster > than the > > rate that the hardware can put elements in the buffer. Is there just one > > buffered shared between all the counters? If so, the trace buffer scan > will need > > to determine which counter the event is for. What happens to the counter > when > > the trace buffer service interrupt is triggered, can it take more > samples or > > does the counter freeze. If it loses counts when the buffer is filled > that > > wouldn't be very useful. > > > > -Will > > _______________________________________________ > > perfmon mailing list > > [email protected] > > http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/ _______________________________________________ perfmon mailing list [email protected] http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/
