RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

Walukiewicz, Miroslaw Fri, 26 Nov 2010 03:56:48 -0800

Some time ago we discussed a possibility of removing usage of nes_ud_sksq for 
IMA driver as a blocker of pushing IMA solution to kernel.org.


The proposal was using OFED transmit optimized path by /dev/infiniband/rdma_cm 
instead of using private nes_ud_sksq device.

I made an implementation of such solution for checking the performance impact 
and looking for optimize the existing code. 

I made a simple send test (sendto in kernel) using my NEHALEM i7 machine. 
Current nes_ud_sksq implementation achieved about 1,25mln pkts/sec.
The OFED path (with rdma_cm call) achieved about 0,9mlns pkts/sec.


I run oprofile on rdma_cm code and got a following results:

        samples  %        linenr info                 app name                 
symbol name
2586067  24.5323  nes_uverbs.c:558            libnes-rdmav2.so         
nes_upoll_cq
1198042  11.3650  (no location information)   vmlinux                  __up_read
539258    5.1156  (no location information)   vmlinux                  
copy_user_generic_string
407884    3.8693  msa_verbs.c:1692            libmsa.so.1.0.0          
msa_post_send
304569    2.8892  msa_verbs.c:2098            libmsa.so.1.0.0          
usq_sendmsg_noblock
299954    2.8455  (no location information)   vmlinux                  __kmalloc
297463    2.8218  (no location information)   libibverbs.so.1.0.0      
/usr/lib64/libibverbs.so.1.0.0
267951    2.5419  uverbs_cmd.c:1433           ib_uverbs.ko             
ib_uverbs_post_send
264709    2.5111  (no location information)   vmlinux                  kfree
205107    1.9457  port.c:2947                 libmsa.so.1.0.0          sendto
146225    1.3871  (no location information)   vmlinux                  
__down_read
145941    1.3844  (no location information)   libpthread-2.5.so        
__write_nocancel
139934    1.3275  nes_ud.c:1746               iw_nes.ko                
nes_ud_post_send_new_path
131879    1.2510  send.c:32                   msa_tst                  
blocking_test_send(void*)
127519    1.2097  (no location information)   vmlinux                  
system_call
123552    1.1721  port.c:858                  libmsa.so.1.0.0          
find_mcast
109249    1.0364  nes_verbs.c:3478            iw_nes.ko                
nes_post_send
92060     0.8733  (no location information)   vmlinux                  vfs_write
90187     0.8555  uverbs_cmd.c:144            ib_uverbs.ko             
__idr_get_uobj
89563     0.8496  nes_uverbs.c:1460           libnes-rdmav2.so         
nes_upost_send

Form the trace it looks like the    __up_read() - 11% wastes most of time. It 
is called from idr_read_qp when a  put_uobj_read is called. 

        if (copy_from_user(&cmd, buf, sizeof cmd))  - 5% it is called twice 
from ib_uverbs_post_send() for IMA and once in ib_uverbs_write() per each frame
                return -EFAULT;

and __kmalloc/kfree - 5% is the third function that has a big meaning. It is 
called twice for each frame transmitted.

It is about 20% of performance loss comparing to nes_ud_sksq path which we miss 
when we use a OFED path. 

What I can modify is a kmalloc/kfree optimization - it is possible to make 
allocation only at start and use pre-allocated buffers.

 I don't see any way for optimalization of idr_read_qp usage or copy_user. In 
current approach we use a shared page and a separate nes_ud_sksq handle for 
each created QP so there is no need for any user space data copy or QP lookup. 

Do you have any idea how can we optimize this path?

Regards,

Mirek

-----Original Message-----
From: Or Gerlitz [mailto:[email protected]] 
Sent: Thursday, November 25, 2010 4:01 PM
To: Walukiewicz, Miroslaw
Cc: Jason Gunthorpe; Roland Dreier; Roland Dreier; Hefty, Sean; 
[email protected]
Subject: Re: ibv_post_send/recv kernel path optimizations (was: uverbs: handle 
large number of entries)

Jason Gunthorpe wrote:
> Hmm, considering your list is everything but Mellanox, maybe it makes much 
> more sense to push the copy_to_user down into the driver - ie a 
> ibv_poll_cq_user - then the driver can construct each CQ entry on the stack 
> and copy it to userspace, avoid the double copy, allocation and avoid any 
> fixed overhead of ibv_poll_cq.
>
> A bigger change to be sure, but remember this old thread:
> http://www.mail-archive.com/[email protected]/msg05114.html
> 2x improvement by removing allocs on the post path..
Hi Mirek,

Any updates on your findings with the patches?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: ibv_post_send/recv kernel path optimizations (was: uverbs: handle large number of entries)

Reply via email to