Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Sean Sullivan Mon, 30 Apr 2018 18:06:07 -0700

I forgot that I left my VM mount command running. It hangs my VM but more
alarming is that it crashes my MDS servers on the ceph cluster. The ceph
cluster is all hardware nodes and the openstack vm does not have an admin
keyring (although the cephX keyring for cephfs generated does have write
permissions to the ec42 pool.



 +-------------------------------------------------------------+
                                             |
                               |
                                             |       Luminous CephFS
Cluster                               |
                                             |       version 12.2.4
                                |
                                             |       Ubuntu 16.04
                                |
                                             |       4.10.0-38-generic (all
hardware nodes)                |
                                             |
                               |
+--------------------+
 +-------------------+--------------------+--------------------+
|                    |                       |                   |
          |                    |
|  Openstack VM      |                       |  Ceph Monitor A   |  Ceph
Monitor B    |  Ceph Monitor C    |
|  Ubuntu 16.04      +------------------->   |  Ceph Mon Server  |  Ceph
MDS A        |  Ceph MDS Failover |
|  4.13.0-39-generic |                       |  kh08-8           |  Kh09-8
          |  kh10-8            |
|  Cephfs via kernel |                       |                   |
          |                    |
+--------------------+
 +-------------------+--------------------+--------------------+
                                             |
                               |
                                             |        ec42
               16384 PGs       |
                                             |        CephFS Data Pool
                               |
                                             |        Erasure coded with
4/2 profile                       |
                                             |
                               |

 +-------------------------------------------------------------+
                                             |
                               |
                                             |       cephfs_metadata
               4096 PGs        |
                                             |       CephFS Metadata Pool
                                |
                                             |       Replicated pool (n=3)
                               |
                                             |
                               |

 +-------------------------------------------------------------+

As far as I am aware this shouldn't happen. I will try upgrading as soon as
I can but I didn't see anything like this mentioned in the change log and
am worried this will still exist in 12.2.5. Has anyone seen this before?


On Mon, Apr 30, 2018 at 7:24 PM, Sean Sullivan <[email protected]> wrote:

> So I think I can reliably reproduce this crash from a ceph client.
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
>     id:     9f58ee5a-7c5d-4d68-81ee-debe16322544
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
>     mgr: kh08-8(active)
>     mds: cephfs-1/1/1 up  {0=kh09-8=up:active}, 1 up:standby
>     osd: 570 osds: 570 up, 570 in
> ```
>
>
> then from a client try to mount aufs over cephfs:
> ```
> mount -vvv -t aufs -o br=/cephfs=rw:/mnt/aufs=rw -o udba=reval none /aufs
> ```
>
> Now watch as your ceph mds servers fail:
>
> ```
> root@kh08-8:~# ceph -s
>   cluster:
>     id:     9f58ee5a-7c5d-4d68-81ee-debe16322544
>     health: HEALTH_WARN
>             insufficient standby MDS daemons available
>
>   services:
>     mon: 3 daemons, quorum kh08-8,kh09-8,kh10-8
>     mgr: kh08-8(active)
>     mds: cephfs-1/1/1 up  {0=kh10-8=up:active(laggy or crashed)}
> ```
>
>
> I am now stuck in a degraded and I can't seem to get them to start again.
>
> On Mon, Apr 30, 2018 at 5:06 PM, Sean Sullivan <[email protected]>
> wrote:
>
>> I had 2 MDS servers (one active one standby) and both were down. I took a
>> dumb chance and marked the active as down (it said it was up but laggy).
>> Then started the primary again and now both are back up. I have never seen
>> this before I am also not sure of what I just did.
>>
>> On Mon, Apr 30, 2018 at 4:32 PM, Sean Sullivan <[email protected]>
>> wrote:
>>
>>> I was creating a new user and mount point. On another hardware node I
>>> mounted CephFS as admin to mount as root. I created /aufstest and then
>>> unmounted. From there it seems that both of my mds nodes crashed for some
>>> reason and I can't start them any more.
>>>
>>> https://pastebin.com/1ZgkL9fa -- my mds log
>>>
>>> I have never had this happen in my tests so now I have live data here.
>>> If anyone can lend a hand or point me in the right direction while
>>> troubleshooting that would be a godsend!
>>>
>>> I tried cephfs-journal-tool inspect and it reports that the journal
>>> should be fine. I am not sure why it's crashing:
>>>
>>> /home/lacadmin# cephfs-journal-tool journal inspect
>>> Overall journal integrity: OK
>>>
>>>
>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 12.2.4 Both Ceph MDS nodes crashed. Please help.

Reply via email to