On Fri, Oct 20, 2017 at 6:32 AM, Josy <[email protected]> wrote:
> Hi,
>
>>> have you checked the output of "ceph-disk list” on the nodes where the
>>> OSDs are not coming back on?
>
> Yes, it shows all the disk correctly mounted.
>
>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>>> produced by the OSD itself when it starts.
>
> This is the error messages seen in one of the OSD log file. Even though the
> service is starting the status shows as down itself.
>
>
> =============================
>
> -7> 2017-10-19 13:16:15.589465 7efefcda4d00 5 osd.28 pg_epoch: 4312
> pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
> -6> 2017-10-19 13:16:15.589476 7efefcda4d00 5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
> -5> 2017-10-19 13:16:15.591629 7efefcda4d00 5 osd.28 pg_epoch: 4312
> pg[33.10(unlocked)] enter Initial
> -4> 2017-10-19 13:16:15.591759 7efefcda4d00 5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] exit Initial 0.000130 0 0.000000
> -3> 2017-10-19 13:16:15.591786 7efefcda4d00 5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
> -2> 2017-10-19 13:16:15.591799 7efefcda4d00 5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
> -1> 2017-10-19 13:16:15.594757 7efefcda4d00 5 osd.28 pg_epoch: 4306
> pg[32.ds0(unlocked)] enter Initial
> 0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
> thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> 38: FAILED assert(stripe_width % stripe_size == 0)
What does your erasure code profile look like for pool 32?
>
>
>
> On 20-10-2017 01:05, Jean-Charles Lopez wrote:
>>
>> Hi,
>>
>> have you checked the output of "ceph-disk list” on the nodes where the
>> OSDs are not coming back on?
>>
>> This should give you a hint on what’s going one.
>>
>> Also use dmesg to search for any error message
>>
>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>> produced by the OSD itself when it starts.
>>
>> Regards
>> JC
>>
>>> On Oct 19, 2017, at 12:11, Josy <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I am not able to start some of the OSDs in the cluster.
>>>
>>> This is a test cluster and had 8 OSDs. One node was taken out for
>>> maintenance. I set the noout flag and after the server came back up I unset
>>> the noout flag.
>>>
>>> Suddenly couple of OSDs went down.
>>>
>>> And now I can start the OSDs manually from each node, but the status is
>>> still "down"
>>>
>>> $ ceph osd stat
>>> 8 osds: 2 up, 5 in
>>>
>>>
>>> $ ceph osd tree
>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
>>> -1 7.97388 root default
>>> -3 1.86469 host a1-osd
>>> 1 ssd 1.86469 osd.1 down 0 1.00000
>>> -5 0.87320 host a2-osd
>>> 2 ssd 0.87320 osd.2 down 0 1.00000
>>> -7 0.87320 host a3-osd
>>> 4 ssd 0.87320 osd.4 down 1.00000 1.00000
>>> -9 0.87320 host a4-osd
>>> 8 ssd 0.87320 osd.8 up 1.00000 1.00000
>>> -11 0.87320 host a5-osd
>>> 12 ssd 0.87320 osd.12 down 1.00000 1.00000
>>> -13 0.87320 host a6-osd
>>> 17 ssd 0.87320 osd.17 up 1.00000 1.00000
>>> -15 0.87320 host a7-osd
>>> 21 ssd 0.87320 osd.21 down 1.00000 1.00000
>>> -17 0.87000 host a8-osd
>>> 28 ssd 0.87000 osd.28 down 0 1.00000
>>>
>>> Also can see this error in each OSD node.
>>>
>>> # systemctl status ceph-osd@1
>>> ● [email protected] - Ceph object storage daemon osd.1
>>> Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled;
>>> vendor preset: disabled)
>>> Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
>>> PDT; 19min ago
>>> Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id
>>> %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
>>> Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
>>> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>>> Main PID: 4163 (code=killed, signal=ABRT)
>>>
>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit [email protected]
>>> entered failed state.
>>> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: [email protected] failed.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: [email protected] holdoff
>>> time over, scheduling restart.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too
>>> quickly for [email protected]
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object
>>> storage daemon osd.1.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit [email protected]
>>> entered failed state.
>>> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: [email protected] failed.
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com