Re: [ceph-users] MDS Problems - Solved but reporting for benefit of others

Gregory Farnum Tue, 08 Nov 2016 14:56:11 -0800

On Wed, Nov 2, 2016 at 2:49 PM, Nick Fisk <[email protected]> wrote:
> A bit more digging, the original crash appears to be similar (but not exactly 
> the same) as this tracker report
>
> http://tracker.ceph.com/issues/16983
>
> I can see that this was fixed in 10.2.3, so I will probably look to upgrade.
>
> If the logs make sense to anybody with a bit more knowledge I would be 
> interested if that bug is related or if I have stumbled on
> something new.


Yep, from what's present it definitely looks like that. Good searching. :)
-Greg

>
> Nick
>
>> -----Original Message-----
>> From: ceph-users [mailto:[email protected]] On Behalf Of 
>> Nick Fisk
>> Sent: 02 November 2016 17:58
>> To: 'Ceph Users' <[email protected]>
>> Subject: [ceph-users] MDS Problems - Solved but reporting for benefit of 
>> others
>>
>> Hi all,
>>
>> Just a bit of an outage with CephFS around the MDS's, I managed to get 
>> everything up and running again after a bit of head
> scratching
>> and thought I would share here what happened.
>>
>> Cause
>> I believe the MDS's which were running as VM's suffered when the hypervisor 
>> ran out of ram and started swapping due to hypervisor
>> maintenance. I know this is less than ideal and have put steps in place to 
>> prevent this happening again.
>>
>> Symptoms
>> 1. Noticed that both MDS's were down, log files on both showed that they had 
>> crashed 2. After restarting MDS's, their status kept
>> flipping between replay and reconnect 3. Now again both MDS's would crash 
>> again 4. Log files showed they seemed to keep restarting
>> after trying to reconnect clients 5. Clients were all kernel one was 3.19 
>> and the rest 4.8. I believe the problematic client was
> one of the
>> ones running Kernel 4.8 6. Ceph is 10.2.2
>>
>> Resolution
>> After some serious head scratching and a little bit of panicking, the fact 
>> the log files showed the restart always happened after
> trying
>> to reconnect the clients gave me the idea to try and kill the sessions on 
>> the MDS.  I first reset all the clients and waited, but
> this didn't
>> seem to have any effect and I could still see the MDS trying to reconnect to 
>> the clients. I then decided to try and kill the
> sessions from
>> the MDS end, so I shutdown the standby MDS (as they kept flipping active 
>> roles) and ran
>>
>> ceph daemon mds.gp-ceph-mds1 session ls
>>
>> I then tried to kill the last session in the list
>>
>> ceph daemon mds.gp-ceph-mds1 session evict <session id>
>>
>> I had to keep hammering this command to get it at the right point, as the 
>> MDS was only responding for a fraction of a second.
>>
>> Suddenly in my other window, where I had the tail of the MDS log, I saw a 
>> whizz of new information and then stopping with the MDS
>> success message. So it seems something the MDS was trying to do whilst 
>> reconnecting was upsetting it. Ceph -s updated so show
>> MDS was now active. Rebooting other MDS then corrected made it standby as 
>> well. Problem solved.
>>
>> I have uploaded the 2 MDS logs here if any CephFS dev's are interested in 
>> taking a closer look.
>>
>> http://app.sys-pro.co.uk/mds_logs.zip
>>
>> Nick
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS Problems - Solved but reporting for benefit of others

Reply via email to