Hello,

The current perfmon2 API allows applications to pass vectors of arguments to
certain calls, in particular to the 3 functions to read/write PMU registers. 
This approach was chosen because it is very flexible and allows applications
to modify either multiple or a single register in one call. It is extensible
because there is no implicit knowledge of the actual number of registers.

Before entering the actual system call, the argument vector must be copied
into a kernel buffer. This is required by convention for security and also
fault reasons. The famous copy_from_user() and copy_to_user() are invoked.
This must be done before interrupts are masked.

Vectors can have different sizes based on the PMU model, the number of event
sets. Yet, the vector must be copied into a kernel-level buffer. Today, we
allocate the kernel-memory on demand based on the size of the vector. We use
kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the
allocated region via a perfmon2 tunable in sysfs. By default, it is set
to a page.

This implementation has worked fairly well, yet it costs some performance
because kmalloc/kfree are expensive (especially kfree). Also it may seem
overkill to malloc a page for small vectors.

I have run some experiments lately and they verified that kmalloc/kfree and
copy to/from user account for a very large portion of the cost for calls
with multiple registers (I tried 4). For the copies it is hard to avoid
them. One thing we could do is to try and reduce the size of the structs.
Today, both pfarg_pmd_t and pfarg_pmc_t do have reserved field for future
extensions. It may be possible to reduce those a little bit.

There are several ways to amortize or eliminate the kmalloc/kfree. First of
all, it is important to understand that multiple threads may call into a 
particular context at any time. All they need is access to the file descriptor.

An alternative that I have explored is to start from the hypothesis that
most vectors are small. If they are small enough, we could avoid the
kmalloc/kfree by using a buffer allocated on the stack. One could say
if the vector is less than 8 elements, then use the stack buffer. If not, then
go down the expensive path of kmalloc/kfree. I tried this experiment and got
over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The
downside of this approach is that kernel stack space is limited and we should
avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes
whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and
this is true for all architectures including i386. I don't know the kernel stack
size on i386 but I suspect it is a page (4kB). Of course, the stack buffer could
be adjusted per object type and per-architecture.

It is important to note that we cannot use a kernel buffer of one element and 
simply
loop over the vector. Because the copy_from/copy_to must be done without locks 
nor
interrupts mask. So one  would have to copy, mask irq, perfmon call, unmask irq,
copy and loop for the next element.

Another approach that was suggested to me is to allocate on demand but not kfree
systematically when the call terminates. In other words, we amortize the cost
of the allocation by keeping the buffer around for the next caller. To make
this work, we would have to decompose the spin_lock_irq*() into spin_*lock()
and local_irq_*able() to avoid the race condition. For the first caller the
buffer would be allocated to fit the size (up to a certain limit like today).
When the call terminates, the buffer is kept via a pointer in the perfmon
context. The next caller, would check the pointer and size, if the buffer
is big enough, copy_user could proceed directly, otherwise a new buffer would
be allocated. that would also work. Yet I can see one issue with this approach
as some malicious user could create lots of contexts and make one call for
each to max out the argument vector limit for each. If you have 1024 descriptors
and the limit is 1 page/context, it could allocate 1024 kernel pages 
(non-pageable)
for nothing. Today we do not have a global argument vector size limit. Adding 
one
would be costly because multiple threads could potentially contend for it and
therefore we would need yet another lock.

I do not see another approach at this point.

Does someone has something else to propose?

If not, what is your opinion of the two approaches above?

Thanks.
-- 

-Stephane
_______________________________________________
perfmon mailing list
[email protected]
http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/

Reply via email to