On 2017-06-29 16:30, Nick Fisk wrote:

> Hi All,
> 
> Putting out a call for help to see if anyone can shed some light on this.
> 
> Configuration:
> Ceph cluster presenting RBD's->XFS->NFS->ESXi
> Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a
> pacemaker cluster
> Both OSD's and clients are go into a pair of switches, single L2 domain (no
> sign from pacemaker that there is network connectivity issues)
> 
> Symptoms:
> - All RBD's on a single client randomly hang for 30s to several minutes,
> confirmed by pacemaker and ESXi hosts complaining
> - Cluster load is minimal when this happens most times
> - All other clients with RBD's are not affected (Same RADOS pool), so its
> seems more of a client issue than cluster issue
> - It looks like pacemaker tries to also stop RBD+FS resource, but this also
> hangs
> - Eventually pacemaker succeeds in stopping resources and immediately
> restarts them, IO returns to normal
> - No errors, slow requests, or any other non normal Ceph status is reported
> on the cluster or ceph.log
> - Client logs show nothing apart from pacemaker
> 
> Things I've tried:
> - Different kernels (potentially happened less with older kernels, but can't
> be 100% sure)
> - Disabling scrubbing and anything else that could be causing high load
> - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a
> day, debug logging was not practical as I can't reproduce on demand)
> 
> Anyone have any ideas?
> 
> Thanks,
> Nick
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

my suggestion is to do a test with pacemaker out of the picture and run
the NFS gateway(s) without HA, this may give a clue if it is ESX->NFX
issue or a pacemaker issue.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to