> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Wednesday, April 24, 2013 18:27
> To: Yan Burman
> Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; [email protected];
> [email protected]; Or Gerlitz
> Subject: Re: NFS over RDMA benchmark
>
> On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 24, 2013 at 12:35:03PM +0000, Yan Burman wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: J. Bruce Fields [mailto:[email protected]]
> > > > Sent: Wednesday, April 24, 2013 00:06
> > > > To: Yan Burman
> > > > Cc: Wendy Cheng; Atchley, Scott; Tom Tucker;
> > > > [email protected]; [email protected]; Or Gerlitz
> > > > Subject: Re: NFS over RDMA benchmark
> > > >
> > > > On Thu, Apr 18, 2013 at 12:47:09PM +0000, Yan Burman wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Wendy Cheng [mailto:[email protected]]
> > > > > > Sent: Wednesday, April 17, 2013 21:06
> > > > > > To: Atchley, Scott
> > > > > > Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
> > > > > > [email protected]; [email protected]
> > > > > > Subject: Re: NFS over RDMA benchmark
> > > > > >
> > > > > > On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
> > > > > > <[email protected]>
> > > > > > wrote:
> > > > > > > On Apr 17, 2013, at 1:15 PM, Wendy Cheng
> > > > > > > <[email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > >> On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
> > > > > > >> <[email protected]>
> > > > > > wrote:
> > > > > > >>> Hi.
> > > > > > >>>
> > > > > > >>> I've been trying to do some benchmarks for NFS over RDMA
> > > > > > >>> and I seem to
> > > > > > only get about half of the bandwidth that the HW can give me.
> > > > > > >>> My setup consists of 2 servers each with 16 cores, 32Gb of
> > > > > > >>> memory, and
> > > > > > Mellanox ConnectX3 QDR card over PCI-e gen3.
> > > > > > >>> These servers are connected to a QDR IB switch. The
> > > > > > >>> backing storage on
> > > > > > the server is tmpfs mounted with noatime.
> > > > > > >>> I am running kernel 3.5.7.
> > > > > > >>>
> > > > > > >>> When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-
> 512K.
> > > > > > >>> When I run fio over rdma mounted nfs, I get 260-2200MB/sec
> > > > > > >>> for the
> > > > > > same block sizes (4-512K). running over IPoIB-CM, I get 200-
> 980MB/sec.
> > > > > > >
> > > > > > > Yan,
> > > > > > >
> > > > > > > Are you trying to optimize single client performance or
> > > > > > > server performance
> > > > > > with multiple clients?
> > > > > > >
> > > > >
> > > > > I am trying to get maximum performance from a single server - I
> > > > > used 2
> > > > processes in fio test - more than 2 did not show any performance boost.
> > > > > I tried running fio from 2 different PCs on 2 different files,
> > > > > but the sum of
> > > > the two is more or less the same as running from single client PC.
> > > > >
> > > > > What I did see is that server is sweating a lot more than the
> > > > > clients and
> > > > more than that, it has 1 core (CPU5) in 100% softirq tasklet:
> > > > > cat /proc/softirqs
> > > >
> > > > Would any profiling help figure out which code it's spending time in?
> > > > (E.g. something simple as "perf top" might have useful output.)
> > > >
> > >
> > >
> > > Perf top for the CPU with high tasklet count gives:
> > >
> > > samples pcnt RIP function
> > > DSO
> > > _______ _____ ________________
> > > ___________________________
> > >
> _________________________________________________________________
> __
> > >
> > > 2787.00 24.1% ffffffff81062a00 mutex_spin_on_owner
> /root/vmlinux
> >
> > I guess that means lots of contention on some mutex? If only we knew
> > which one.... perf should also be able to collect stack statistics, I
> > forget how.
>
> Googling around.... I think we want:
>
> perf record -a --call-graph
> (give it a chance to collect some samples, then ^C)
> perf report --call-graph --stdio
>
Sorry it took me a while to get perf to show the call trace (did not enable
frame pointers in kernel and struggled with perf options...), but what I get is:
36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner
|
--- mutex_spin_on_owner
|
|--99.99%-- __mutex_lock_slowpath
| mutex_lock
| |
| |--85.30%-- generic_file_aio_write
| | do_sync_readv_writev
| | do_readv_writev
| | vfs_writev
| | nfsd_vfs_write
| | nfsd_write
| | nfsd3_proc_write
| | nfsd_dispatch
| | svc_process_common
| | svc_process
| | nfsd
| | kthread
| | kernel_thread_helper
| |
| --14.70%-- svc_send
| svc_process
| nfsd
| kthread
| kernel_thread_helper
--0.01%-- [...]
9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|
--- _raw_spin_lock_irqsave
|
|--43.97%-- alloc_iova
| intel_alloc_iova
| __intel_map_single
| intel_map_page
| |
| |--60.47%-- svc_rdma_sendto
| | svc_send
| | svc_process
| | nfsd
| | kthread
| | kernel_thread_helper
| |
| |--30.10%-- rdma_read_xdr
| | svc_rdma_recvfrom
| | svc_recv
| | nfsd
| | kthread
| | kernel_thread_helper
| |
| |--6.69%-- svc_rdma_post_recv
| | send_reply
| | svc_rdma_sendto
| | svc_send
| | svc_process
| | nfsd
| | kthread
| | kernel_thread_helper
| |
| --2.74%-- send_reply
| svc_rdma_sendto
| svc_send
| svc_process
| nfsd
| kthread
| kernel_thread_helper
|
|--37.52%-- __free_iova
| flush_unmaps
| add_unmap
| intel_unmap_page
| |
| |--97.18%-- svc_rdma_put_frmr
| | sq_cq_reap
| | dto_tasklet_func
| | tasklet_action
| | __do_softirq
| | call_softirq
| | do_softirq
| | |
| | |--97.40%-- irq_exit
| | | |
| | | |--99.85%-- do_IRQ
| | | | ret_from_intr
| | | | |
| | | | |--40.74%--
generic_file_buffered_write
| | | | |
__generic_file_aio_write
| | | | |
generic_file_aio_write
| | | | |
do_sync_readv_writev
| | | | |
do_readv_writev
| | | | |
vfs_writev
| | | | |
nfsd_vfs_write
| | | | |
nfsd_write
| | | | |
nfsd3_proc_write
| | | | |
nfsd_dispatch
| | | | |
svc_process_common
| | | | |
svc_process
| | | | |
nfsd
| | | | |
kthread
| | | | |
kernel_thread_helper
| | | | |
| | | | |--25.21%--
__mutex_lock_slowpath
| | | | |
mutex_lock
| | | | | |
| | | | |
|--94.84%-- generic_file_aio_write
| | | | | |
do_sync_readv_writev
| | | | | |
do_readv_writev
| | | | | |
vfs_writev
| | | | | |
nfsd_vfs_write
| | | | | |
nfsd_write
| | | | | |
nfsd3_proc_write
| | | | | |
nfsd_dispatch
| | | | | |
svc_process_common
| | | | | |
svc_process
| | | | | |
nfsd
| | | | | |
kthread
| | | | | |
kernel_thread_helper
| | | | | |
The entire trace is almost 1MB, so send me an off-list message if you want it.
Yan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html