Hello, The current perfmon2 API allows applications to pass vectors of arguments to certain calls, in particular to the 3 functions to read/write PMU registers. This approach was chosen because it is very flexible and allows applications to modify either multiple or a single register in one call. It is extensible because there is no implicit knowledge of the actual number of registers.
Before entering the actual system call, the argument vector must be copied into a kernel buffer. This is required by convention for security and also fault reasons. The famous copy_from_user() and copy_to_user() are invoked. This must be done before interrupts are masked. Vectors can have different sizes based on the PMU model, the number of event sets. Yet, the vector must be copied into a kernel-level buffer. Today, we allocate the kernel-memory on demand based on the size of the vector. We use kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the allocated region via a perfmon2 tunable in sysfs. By default, it is set to a page. This implementation has worked fairly well, yet it costs some performance because kmalloc/kfree are expensive (especially kfree). Also it may seem overkill to malloc a page for small vectors. I have run some experiments lately and they verified that kmalloc/kfree and copy to/from user account for a very large portion of the cost for calls with multiple registers (I tried 4). For the copies it is hard to avoid them. One thing we could do is to try and reduce the size of the structs. Today, both pfarg_pmd_t and pfarg_pmc_t do have reserved field for future extensions. It may be possible to reduce those a little bit. There are several ways to amortize or eliminate the kmalloc/kfree. First of all, it is important to understand that multiple threads may call into a particular context at any time. All they need is access to the file descriptor. An alternative that I have explored is to start from the hypothesis that most vectors are small. If they are small enough, we could avoid the kmalloc/kfree by using a buffer allocated on the stack. One could say if the vector is less than 8 elements, then use the stack buffer. If not, then go down the expensive path of kmalloc/kfree. I tried this experiment and got over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The downside of this approach is that kernel stack space is limited and we should avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and this is true for all architectures including i386. I don't know the kernel stack size on i386 but I suspect it is a page (4kB). Of course, the stack buffer could be adjusted per object type and per-architecture. It is important to note that we cannot use a kernel buffer of one element and simply loop over the vector. Because the copy_from/copy_to must be done without locks nor interrupts mask. So one would have to copy, mask irq, perfmon call, unmask irq, copy and loop for the next element. Another approach that was suggested to me is to allocate on demand but not kfree systematically when the call terminates. In other words, we amortize the cost of the allocation by keeping the buffer around for the next caller. To make this work, we would have to decompose the spin_lock_irq*() into spin_*lock() and local_irq_*able() to avoid the race condition. For the first caller the buffer would be allocated to fit the size (up to a certain limit like today). When the call terminates, the buffer is kept via a pointer in the perfmon context. The next caller, would check the pointer and size, if the buffer is big enough, copy_user could proceed directly, otherwise a new buffer would be allocated. that would also work. Yet I can see one issue with this approach as some malicious user could create lots of contexts and make one call for each to max out the argument vector limit for each. If you have 1024 descriptors and the limit is 1 page/context, it could allocate 1024 kernel pages (non-pageable) for nothing. Today we do not have a global argument vector size limit. Adding one would be costly because multiple threads could potentially contend for it and therefore we would need yet another lock. I do not see another approach at this point. Does someone has something else to propose? If not, what is your opinion of the two approaches above? Thanks. -- -Stephane _______________________________________________ perfmon mailing list [email protected] http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/
