On 2017-06-29 16:30, Nick Fisk wrote:
> Hi All,
>
> Putting out a call for help to see if anyone can shed some light on this.
>
> Configuration:
> Ceph cluster presenting RBD's->XFS->NFS->ESXi
> Running 10.2.7 on the OSD's and 4.11 kernel on the NFS gateways in a
> pacemaker cluster
> Both OSD's and clients are go into a pair of switches, single L2 domain (no
> sign from pacemaker that there is network connectivity issues)
>
> Symptoms:
> - All RBD's on a single client randomly hang for 30s to several minutes,
> confirmed by pacemaker and ESXi hosts complaining
> - Cluster load is minimal when this happens most times
> - All other clients with RBD's are not affected (Same RADOS pool), so its
> seems more of a client issue than cluster issue
> - It looks like pacemaker tries to also stop RBD+FS resource, but this also
> hangs
> - Eventually pacemaker succeeds in stopping resources and immediately
> restarts them, IO returns to normal
> - No errors, slow requests, or any other non normal Ceph status is reported
> on the cluster or ceph.log
> - Client logs show nothing apart from pacemaker
>
> Things I've tried:
> - Different kernels (potentially happened less with older kernels, but can't
> be 100% sure)
> - Disabling scrubbing and anything else that could be causing high load
> - Enabling Kernel RBD debugging (Problem maybe happens a couple of times a
> day, debug logging was not practical as I can't reproduce on demand)
>
> Anyone have any ideas?
>
> Thanks,
> Nick
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
my suggestion is to do a test with pacemaker out of the picture and run
the NFS gateway(s) without HA, this may give a clue if it is ESX->NFX
issue or a pacemaker issue.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com