Hi Greg, Zheng,

Is this fixed in a later version of the kernel client? Or would it be wise for 
us to start using the fuse client?

Cheers,

Simon

> -----Original Message-----
> From: Gregory Farnum [mailto:gfar...@redhat.com]
> Sent: 31 August 2015 13:02
> To: Yan, Zheng
> Cc: Simon Hallam; Zheng Yan; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Testing CephFS
> 
> On Mon, Aug 31, 2015 at 12:16 PM, Yan, Zheng <uker...@gmail.com> wrote:
> > On Mon, Aug 24, 2015 at 6:38 PM, Gregory Farnum
> <gfar...@redhat.com> wrote:
> >> On Mon, Aug 24, 2015 at 11:35 AM, Simon  Hallam <s...@pml.ac.uk>
> wrote:
> >>> Hi Greg,
> >>>
> >>> The MDS' detect that the other one went down and started the replay.
> >>>
> >>> I did some further testing with 20 client machines. Of the 20 client
> machines, 5 hung with the following error:
> >>>
> >>> [Aug24 10:53] ceph: mds0 caps stale
> >>> [Aug24 10:54] ceph: mds0 caps stale
> >>> [Aug24 10:58] ceph: mds0 hung
> >>> [Aug24 11:03] ceph: mds0 came back
> >>> [  +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state
> OPEN)
> >>> [  +0.000018] libceph: mon2 10.15.0.3:6789 session lost, hunting for new
> mon
> >>> [Aug24 11:04] ceph: mds0 reconnect start
> >>> [  +0.084938] libceph: mon2 10.15.0.3:6789 session established
> >>> [  +0.008475] ceph: mds0 reconnect denied
> >>
> >> Oh, this might be a kernel bug, failing to ask for mdsmap updates when
> >> the connection goes away. Zheng, does that sound familiar?
> >> -Greg
> >>
> >
> > I reproduced this locally (use SIGSTOP to stop the monitor) . I think
> > the root cause is that kernel client does not implement
> > CEPH_FEATURE_MSGR_KEEPALIVE2. So the kernel client couldn't reliably
> > detect the event that network cable got unplugged. It kept waiting for
> > new events from the disconnected connection.
> 
> Yeah, the userspace client maintains an ongoing MDSMap subscription
> from the monitors in order to hear about this. It puts more load on
> the monitors but right now that's the solution we're going with: the
> monitor times out the MDS, publishes a series of new maps (pushed to
> the clients) in order to activate a standby, and the clients see that
> they need to connect to the new MDS instance.
> -Greg


Please visit our new website at www.pml.ac.uk and follow us on Twitter  
@PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
in England & Wales, company number 4178503. Registered Charity No. 1091222. 
Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. You are 
reminded that e-mail communications are not secure and may contain viruses; PML 
accepts no liability for any loss or damage which may be caused by viruses.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to