Flow-control may well just mask the real problem. Did your throughput improve? Also, does that mean flow-control is on for all ports on the switch...? IIUC, then such "global pause" flow-control will mean switchports with links to upstream network devices will also be paused if the switch is attempting to pass packets from those ports down to a congested host.
Are your ring buffers tuned up as high as possible, `ethtool -g <ifname>`? On 11 September 2017 at 23:09, Andreas Herrmann <[email protected]> wrote: > Hi, > > flow control was active on the NIC but not on the switch. > > Enabling flowcontrol for both direction solved the problem: > flowcontrol receive on > flowcontrol send on > > Port Send FlowControl Receive FlowControl RxPause TxPause > admin oper admin oper > ---------- -------- -------- -------- -------- ------------- > ------------- > Et17/1 on on on on 0 64500 > Et17/2 on on on on 0 33746 > Et17/3 on on on on 0 17126 > Et18/1 on on on on 0 36948 > Et18/2 on on on on 0 39628 > > Regards, > Andreas > > > On 08.09.2017 13:57, Andreas Herrmann wrote: > > Hello, > > > > I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F, > Xeon > > E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to > a > > LSI-9300-8i (SAS3008) controller used as OSDs for Ceph (12.1.2) > > > > The servers are connected to two Arista DCS-7060CX-32S switches. I'm > using > > MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000): > > * backend network for Ceph: cluster network & public network > > Mellanox ConnectX-4 Lx dual-port 25 GBit/s > > * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ > dual-port > > > > Ceph is quite a default installation with size=3. > > > > My problem: > > I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in > a test > > virtual machine (the only one running in the cluster) with arround 210 > MB/s. I > > get output drops on all switchports. The drop rate is between 0.1 - 0.9 > %. The > > drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph. > > > > First I thought about a problem with the Mellanox cards and used the > Intel > > cards for ceph traffic. The problem also exists. > > > > I tried quite a lot and nothing help: > > * changed the MTU from 9000 to 1500 > > * changed bond_xmit_hash_policy from layer3+4 to layer2+3 > > * deactivated the bond and just used a single link > > * disabled offloading > > * disabled power management in BIOS > > * perf-bias 0 > > > > I analyzed the traffic via tcpdump and got some of those "errors": > > * TCP Previous segment not captured > > * TCP Out-of-Order > > * TCP Retransmission > > * TCP Fast Retransmission > > * TCP Dup ACK > > * TCP ACKed unseen segment > > * TCP Window Update > > > > Is that behaviour normal for ceph or has anyone ideas how to solve that > > problem with the output drops at switch-side > > > > With iperf I can reach full 50 GBit/s on the bond with zero output drops. > > > > Andreas > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cheers, ~Blairo
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
