Flow-control may well just mask the real problem. Did your throughput
improve? Also, does that mean flow-control is on for all ports on the
switch...? IIUC, then such "global pause" flow-control will mean
switchports with links to upstream network devices will also be paused if
the switch is attempting to pass packets from those ports down to a
congested host.

Are your ring buffers tuned up as high as possible, `ethtool -g <ifname>`?

On 11 September 2017 at 23:09, Andreas Herrmann <[email protected]> wrote:

> Hi,
>
> flow control was active on the NIC but not on the switch.
>
> Enabling flowcontrol for both direction solved the problem:
>         flowcontrol receive on
>         flowcontrol send on
>
> Port        Send FlowControl  Receive FlowControl  RxPause       TxPause
>             admin    oper     admin    oper
> ----------  -------- -------- -------- --------    -------------
> -------------
> Et17/1      on       on       on       on          0             64500
> Et17/2      on       on       on       on          0             33746
> Et17/3      on       on       on       on          0             17126
> Et18/1      on       on       on       on          0             36948
> Et18/2      on       on       on       on          0             39628
>
> Regards,
> Andreas
>
>
> On 08.09.2017 13:57, Andreas Herrmann wrote:
> > Hello,
> >
> > I have a fresh Proxmox installation on 5 servers (Supermciro X10SRW-F,
> Xeon
> > E5-1660 v4, 128 GB RAM) with each 8 Samsung SSD SM863 960GB connected to
> a
> > LSI-9300-8i (SAS3008) controller used as OSDs for Ceph (12.1.2)
> >
> > The servers are connected to two Arista DCS-7060CX-32S switches. I'm
> using
> > MLAG bond (bondmode LACP, xmit_hash_policy layer3+4, MTU 9000):
> >  * backend network for Ceph: cluster network & public network
> >    Mellanox ConnectX-4 Lx dual-port 25 GBit/s
> >  * frontend network: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> dual-port
> >
> > Ceph is quite a default installation with size=3.
> >
> > My problem:
> > I'm issuing a dd (dd if=/dev/urandom of=urandom.0 bs=10M count=1024) in
> a test
> > virtual machine (the only one running in the cluster) with arround 210
> MB/s. I
> > get output drops on all switchports. The drop rate is between 0.1 - 0.9
> %. The
> > drop rate of 0.9 % is reached when writing with about 1300MB/s into ceph.
> >
> > First I thought about a problem with the Mellanox cards and used the
> Intel
> > cards for ceph traffic. The problem also exists.
> >
> > I tried quite a lot and nothing help:
> >  * changed the MTU from 9000 to 1500
> >  * changed bond_xmit_hash_policy from layer3+4 to layer2+3
> >  * deactivated the bond and just used a single link
> >  * disabled offloading
> >  * disabled power management in BIOS
> >  * perf-bias 0
> >
> > I analyzed the traffic via tcpdump and got some of those "errors":
> >  * TCP Previous segment not captured
> >  * TCP Out-of-Order
> >  * TCP Retransmission
> >  * TCP Fast Retransmission
> >  * TCP Dup ACK
> >  * TCP ACKed unseen segment
> >  * TCP Window Update
> >
> > Is that behaviour normal for ceph or has anyone ideas how to solve that
> > problem with the output drops at switch-side
> >
> > With iperf I can reach full 50 GBit/s on the bond with zero output drops.
> >
> > Andreas
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
~Blairo
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to