Hello Aaron, Yes we saw recently an issue with VERBS RDMA rdma send error IBV_WC_RETRY_EXC_ERR to 111.11.11.11 (sidra.nnode_group2.gpfs) on mlx5_0 port 2 fabnum 0 vendor_err 129 And
Tushar B Pathare MBA IT,BE IT Bigdata & GPFS Software Development & Databases Scientific Computing Bioinformatics Division Research "What ever the mind of man can conceive and believe, drill can query" Sidra Medical and Research Centre Sidra OPC Building Sidra Medical & Research Center PO Box 26999 Al Luqta Street Education City North Campus Qatar Foundation, Doha, Qatar Office 4003 3333 ext 37443 | M +974 74793547 tpath...@sidra.org<mailto:tpath...@sidra.org> | www.sidra.org<http://www.sidra.org/> From: <gpfsug-discuss-boun...@spectrumscale.org> on behalf of "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" <aaron.s.knis...@nasa.gov> Reply-To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Date: Sunday, May 21, 2017 at 11:59 AM To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Subject: Re: [gpfsug-discuss] VERBS RDMA issue Hi Tushar, For me the issue was an underlying performance bottleneck (some CPU frequency scaling problems causing cores to throttle back when it wasn't appropriate). I noticed you have verbsRdmaSend set to yes. I've seen suggestions in the past to turn this off under certain conditions although I don't remember what those where. Hopefully others can chime in and qualify that. Are you seeing any RDMA errors in your logs? (e.g. grep IBV_ out of the mmfs.log). -Aaron On May 21, 2017 at 04:41:00 EDT, Tushar Pathare <tpath...@sidra.org> wrote: Hello Team, We are facing a lot of messages waiters related to waiting for conn rdmas < conn maxrdmas<https://www.mail-archive.com/search?l=gpfsug-discuss@spectrumscale.org&q=subject:%22Re%5C%3A+%5C%5Bgpfsug%5C-discuss%5C%5D+waiting+for+conn+rdmas+%3C+conn+maxrdmas%22&o=newest> Is there some recommended settings to resolve this issue.? Our config for RDMA is as follows for 140 nodes(32 cores each) VERBS RDMA Configuration: Status : started Start time : Thu Stats reset time : Thu Dump time : Sun mmfs verbsRdma : enable mmfs verbsRdmaCm : disable mmfs verbsPorts : mlx4_0/1 mlx4_0/2 mmfs verbsRdmasPerNode : 3200 mmfs verbsRdmasPerNode (max) : 3200 mmfs verbsRdmasPerNodeOptimize : yes mmfs verbsRdmasPerConnection : 16 mmfs verbsRdmasPerConnection (max) : 16 mmfs verbsRdmaMinBytes : 16384 mmfs verbsRdmaRoCEToS : -1 mmfs verbsRdmaQpRtrMinRnrTimer : 18 mmfs verbsRdmaQpRtrPathMtu : 2048 mmfs verbsRdmaQpRtrSl : 0 mmfs verbsRdmaQpRtrSlDynamic : no mmfs verbsRdmaQpRtrSlDynamicTimeout : 10 mmfs verbsRdmaQpRtsRnrRetry : 6 mmfs verbsRdmaQpRtsRetryCnt : 6 mmfs verbsRdmaQpRtsTimeout : 18 mmfs verbsRdmaMaxSendBytes : 16777216 mmfs verbsRdmaMaxSendSge : 27 mmfs verbsRdmaSend : yes mmfs verbsRdmaSerializeRecv : no mmfs verbsRdmaSerializeSend : no mmfs verbsRdmaUseMultiCqThreads : yes mmfs verbsSendBufferMemoryMB : 1024 mmfs verbsLibName : libibverbs.so mmfs verbsRdmaCmLibName : librdmacm.so mmfs verbsRdmaMaxReconnectInterval : 60 mmfs verbsRdmaMaxReconnectRetries : -1 mmfs verbsRdmaReconnectAction : disable mmfs verbsRdmaReconnectThreads : 32 mmfs verbsHungRdmaTimeout : 90 ibv_fork_support : true Max connections : 196608 Max RDMA size : 16777216 Target number of vsend buffs : 16384 Initial vsend buffs per conn : 59 nQPs : 140 nCQs : 282 nCMIDs : 0 nDtoThreads : 2 nextIndex : 141 Number of Devices opened : 1 Device : mlx4_0 vendor_id : 713 Device vendor_part_id : 4099 Device mem register chunk : 8589934592 (0x200000000) Device max_sge : 32 Adjusted max_sge : 0 Adjusted max_sge vsend : 30 Device max_qp_wr : 16351 Device max_qp_rd_atom : 16 Open Connect Ports : 1 verbsConnectPorts[0] : mlx4_0/1/0 lid : 129 state : IBV_PORT_ACTIVE path_mtu : 2048 interface ID : 0xe41d2d030073b9d1 sendChannel.ib_channel : 0x7FA6CB816200 sendChannel.dtoThreadP : 0x7FA6CB821870 sendChannel.dtoThreadId : 12540 sendChannel.nFreeCq : 1 recvChannel.ib_channel : 0x7FA6CB81D590 recvChannel.dtoThreadP : 0x7FA6CB822BA0 recvChannel.dtoThreadId : 12541 recvChannel.nFreeCq : 1 ibv_cq : 0x7FA2724C81F8 ibv_cq.cqP : 0x0 ibv_cq.nEvents : 0 ibv_cq.contextP : 0x0 ibv_cq.ib_channel : 0x0 Thanks Tushar B Pathare MBA IT,BE IT Bigdata & GPFS Software Development & Databases Scientific Computing Bioinformatics Division Research "What ever the mind of man can conceive and believe, drill can query" Sidra Medical and Research Centre Sidra OPC Building Sidra Medical & Research Center PO Box 26999 Al Luqta Street Education City North Campus Qatar Foundation, Doha, Qatar Office 4003 3333 ext 37443 | M +974 74793547 tpath...@sidra.org<mailto:tpath...@sidra.org> | www.sidra.org<http://www.sidra.org/> Disclaimer: This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient, any reading, printing, storage, disclosure, copying or any other action taken in respect of this e-mail is prohibited and may be unlawful. If you are not the intended recipient, please notify the sender immediately by using the reply function and then permanently delete what you have received. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Sidra Medical and Research Center. Disclaimer: This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient, any reading, printing, storage, disclosure, copying or any other action taken in respect of this e-mail is prohibited and may be unlawful. If you are not the intended recipient, please notify the sender immediately by using the reply function and then permanently delete what you have received. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Sidra Medical and Research Center.
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss