Hi,
have you checked the output of "ceph-disk list” on the nodes where the OSDs are
not coming back on?
This should give you a hint on what’s going one.
Also use dmesg to search for any error message
And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced
by the OSD itself when it starts.
Regards
JC
> On Oct 19, 2017, at 12:11, Josy <[email protected]> wrote:
>
> Hi,
>
> I am not able to start some of the OSDs in the cluster.
>
> This is a test cluster and had 8 OSDs. One node was taken out for
> maintenance. I set the noout flag and after the server came back up I unset
> the noout flag.
>
> Suddenly couple of OSDs went down.
>
> And now I can start the OSDs manually from each node, but the status is still
> "down"
>
> $ ceph osd stat
> 8 osds: 2 up, 5 in
>
>
> $ ceph osd tree
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 7.97388 root default
> -3 1.86469 host a1-osd
> 1 ssd 1.86469 osd.1 down 0 1.00000
> -5 0.87320 host a2-osd
> 2 ssd 0.87320 osd.2 down 0 1.00000
> -7 0.87320 host a3-osd
> 4 ssd 0.87320 osd.4 down 1.00000 1.00000
> -9 0.87320 host a4-osd
> 8 ssd 0.87320 osd.8 up 1.00000 1.00000
> -11 0.87320 host a5-osd
> 12 ssd 0.87320 osd.12 down 1.00000 1.00000
> -13 0.87320 host a6-osd
> 17 ssd 0.87320 osd.17 up 1.00000 1.00000
> -15 0.87320 host a7-osd
> 21 ssd 0.87320 osd.21 down 1.00000 1.00000
> -17 0.87000 host a8-osd
> 28 ssd 0.87000 osd.28 down 0 1.00000
>
> Also can see this error in each OSD node.
>
> # systemctl status ceph-osd@1
> ● [email protected] - Ceph object storage daemon osd.1
> Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled; vendor
> preset: disabled)
> Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT;
> 19min ago
> Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i
> --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
> Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster
> ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
> Main PID: 4163 (code=killed, signal=ABRT)
>
> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit [email protected] entered
> failed state.
> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: [email protected] failed.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: [email protected] holdoff time
> over, scheduling restart.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
> quickly for [email protected]
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object
> storage daemon osd.1.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit [email protected] entered
> failed state.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: [email protected] failed.
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com