Hi Stefane, I'm still catching up on other emails, but I'm often in favor of the 'simple' approach. I'm sure the LKML folks will have their own opinions, but the simple fixed number approach seems like the right compromise...perhaps with a tunable parameter depending on the architecture and stack size. Also, perhaps we could get more out of it by doing a software prefetch on the stack buffer...reducing the cold misses...Honestly, it would be ideal if we could get those structs to fit in a handful of registers or cache lines...
My 2 cents... Phil On Wed, 2006-06-14 at 07:23 -0700, Stephane Eranian wrote: > Hello, > > The current perfmon2 API allows applications to pass vectors of arguments to > certain calls, in particular to the 3 functions to read/write PMU registers. > This approach was chosen because it is very flexible and allows applications > to modify either multiple or a single register in one call. It is extensible > because there is no implicit knowledge of the actual number of registers. > > Before entering the actual system call, the argument vector must be copied > into a kernel buffer. This is required by convention for security and also > fault reasons. The famous copy_from_user() and copy_to_user() are invoked. > This must be done before interrupts are masked. > > Vectors can have different sizes based on the PMU model, the number of event > sets. Yet, the vector must be copied into a kernel-level buffer. Today, we > allocate the kernel-memory on demand based on the size of the vector. We use > kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the > allocated region via a perfmon2 tunable in sysfs. By default, it is set > to a page. > > This implementation has worked fairly well, yet it costs some performance > because kmalloc/kfree are expensive (especially kfree). Also it may seem > overkill to malloc a page for small vectors. > > I have run some experiments lately and they verified that kmalloc/kfree and > copy to/from user account for a very large portion of the cost for calls > with multiple registers (I tried 4). For the copies it is hard to avoid > them. One thing we could do is to try and reduce the size of the structs. > Today, both pfarg_pmd_t and pfarg_pmc_t do have reserved field for future > extensions. It may be possible to reduce those a little bit. > > There are several ways to amortize or eliminate the kmalloc/kfree. First of > all, it is important to understand that multiple threads may call into a > particular context at any time. All they need is access to the file > descriptor. > > An alternative that I have explored is to start from the hypothesis that > most vectors are small. If they are small enough, we could avoid the > kmalloc/kfree by using a buffer allocated on the stack. One could say > if the vector is less than 8 elements, then use the stack buffer. If not, then > go down the expensive path of kmalloc/kfree. I tried this experiment and got > over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The > downside of this approach is that kernel stack space is limited and we should > avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes > whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and > this is true for all architectures including i386. I don't know the kernel > stack > size on i386 but I suspect it is a page (4kB). Of course, the stack buffer > could > be adjusted per object type and per-architecture. > > It is important to note that we cannot use a kernel buffer of one element and > simply > loop over the vector. Because the copy_from/copy_to must be done without > locks nor > interrupts mask. So one would have to copy, mask irq, perfmon call, unmask > irq, > copy and loop for the next element. > > Another approach that was suggested to me is to allocate on demand but not > kfree > systematically when the call terminates. In other words, we amortize the cost > of the allocation by keeping the buffer around for the next caller. To make > this work, we would have to decompose the spin_lock_irq*() into spin_*lock() > and local_irq_*able() to avoid the race condition. For the first caller the > buffer would be allocated to fit the size (up to a certain limit like today). > When the call terminates, the buffer is kept via a pointer in the perfmon > context. The next caller, would check the pointer and size, if the buffer > is big enough, copy_user could proceed directly, otherwise a new buffer would > be allocated. that would also work. Yet I can see one issue with this approach > as some malicious user could create lots of contexts and make one call for > each to max out the argument vector limit for each. If you have 1024 > descriptors > and the limit is 1 page/context, it could allocate 1024 kernel pages > (non-pageable) > for nothing. Today we do not have a global argument vector size limit. Adding > one > would be costly because multiple threads could potentially contend for it and > therefore we would need yet another lock. > > I do not see another approach at this point. > > Does someone has something else to propose? > > If not, what is your opinion of the two approaches above? > > Thanks. _______________________________________________ perfmon mailing list [email protected] http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/
