Re: [ceph-users] Help Ceph Cluster Down

2019-01-07 Thread Caspar Smit
Arun,

This is what i already suggested in my first reply.

Kind regards,
Caspar

Op za 5 jan. 2019 om 06:52 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:

> Hi Kevin,
>
> You are right. Increasing number of PGs per OSD resolved the issue. I will
> probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs
> so it applies on host boot.
>
> Thanks
> Arun
>
> On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich  wrote:
>
>> Hi Arun,
>>
>> actually deleting was no good idea, thats why I wrote, that the OSDs
>> should be "out".
>> You have down PGs, that because the data is on OSDs that are
>> unavailable but known by the cluster.
>> This can be checked by using "ceph pg 0.5 query" (change PG name).
>>
>> Because your PG count is so much oversized, the overdose limits get
>> hit on every recovery on your cluster.
>> I had the same problem on a medium cluster when I added to many new
>> disks at once.
>> You already got this info from Caspar earlier in this thread.
>>
>>
>> https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/
>>
>> https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
>>
>> The second link shows one of the config params you need to inject to
>> all your OSDs like this:
>> ceph tell osd.* injectargs --mon_max_pg_per_osd 1
>>
>> This might help you getting these PGs some sort of "active"
>> (+recovery/+degraded/+inconsistent/etc.).
>>
>> The down PGs will most likely never come back. It would bet, you will
>> find OSD IDs that are invalid in the acting set, meaning that
>> non-existent OSDs hold your data.
>> I had a similar problem on a test cluster with erasure code pools
>> where too many disks failed at the same time, you will then see
>> negative values as OSD IDs.
>>
>> Maybe this helps a little bit.
>>
>> Kevin
>>
>> Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA
>> :
>> >
>> > Hi Kevin,
>> >
>> > I tried deleting newly added server from Ceph Cluster and looks like
>> Ceph is not recovering. I agree with unfound data but it doesn't say about
>> unfound data. It says inactive/down for PGs and I can't bring them up.
>> >
>> >
>> > [root@fre101 ~]# ceph health detail
>> > 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2)
>> No such file or directory
>> > HEALTH_ERR 3 pools have many more objects per pg than average;
>> 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation;
>> Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering,
>> 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded
>> (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are
>> blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs
>> per OSD (3003 > max 200)
>> > MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>> > pool glance-images objects per pg (10478) is more than 92.7257
>> times cluster average (113)
>> > pool vms objects per pg (4722) is more than 41.7876 times cluster
>> average (113)
>> > pool volumes objects per pg (1220) is more than 10.7965 times
>> cluster average (113)
>> > OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
>> > PENDING_CREATING_PGS 6517 PGs pending on creation
>> > osds
>> [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
>> have pending PGs.
>> > PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs
>> down, 2 pgs peering, 2703 pgs stale
>> > pg 10.90e is stuck inactive for 94928.999109, current state
>> activating, last acting [2,6]
>> > pg 10.913 is stuck inactive for 95094.175400, current state
>> activating, last acting [9,5]
>> > pg 10.915 is stuck inactive for 94929.184177, current state
>> activating, last acting [30,26]
>> > pg 11.907 is stuck stale for 9612.906582, current state
>> stale+active+clean, last acting [38,24]
>> > pg 11.910 is stuck stale for 11822.359237, current state
>> stale+down, last acting [21]
>> > pg 11.915 is stuck stale for 9612.906604, current state
>> stale+active+clean, last acting [38,31]
>> > pg 11.919 is stuck inactive for 95636.716568, current state
>> activating, last acting [25,12]
>> > pg 12.902 is stuck stale for 10810.497213, current state
>> stale+activating, last acting [36,14]
>> > pg 13.901 is stuck stale for 94889.512234, current state
>> stale+active+clean, last acting [1,31]
>> > pg 13.904 is stuck stale for 10745.279158, current state
>> stale+active+clean, last acting [37,8]
>> > pg 13.908 is stuck stale for 10745.279176, current state
>> 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin,

You are right. Increasing number of PGs per OSD resolved the issue. I will
probably add this config in /etc/ceph/ceph.conf file of ceph mon and OSDs
so it applies on host boot.

Thanks
Arun

On Fri, Jan 4, 2019 at 3:46 PM Kevin Olbrich  wrote:

> Hi Arun,
>
> actually deleting was no good idea, thats why I wrote, that the OSDs
> should be "out".
> You have down PGs, that because the data is on OSDs that are
> unavailable but known by the cluster.
> This can be checked by using "ceph pg 0.5 query" (change PG name).
>
> Because your PG count is so much oversized, the overdose limits get
> hit on every recovery on your cluster.
> I had the same problem on a medium cluster when I added to many new
> disks at once.
> You already got this info from Caspar earlier in this thread.
>
>
> https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/
>
> https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/
>
> The second link shows one of the config params you need to inject to
> all your OSDs like this:
> ceph tell osd.* injectargs --mon_max_pg_per_osd 1
>
> This might help you getting these PGs some sort of "active"
> (+recovery/+degraded/+inconsistent/etc.).
>
> The down PGs will most likely never come back. It would bet, you will
> find OSD IDs that are invalid in the acting set, meaning that
> non-existent OSDs hold your data.
> I had a similar problem on a test cluster with erasure code pools
> where too many disks failed at the same time, you will then see
> negative values as OSD IDs.
>
> Maybe this helps a little bit.
>
> Kevin
>
> Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA
> :
> >
> > Hi Kevin,
> >
> > I tried deleting newly added server from Ceph Cluster and looks like
> Ceph is not recovering. I agree with unfound data but it doesn't say about
> unfound data. It says inactive/down for PGs and I can't bring them up.
> >
> >
> > [root@fre101 ~]# ceph health detail
> > 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to
> '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2)
> No such file or directory
> > HEALTH_ERR 3 pools have many more objects per pg than average;
> 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation;
> Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering,
> 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded
> (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are
> blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs
> per OSD (3003 > max 200)
> > MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
> > pool glance-images objects per pg (10478) is more than 92.7257 times
> cluster average (113)
> > pool vms objects per pg (4722) is more than 41.7876 times cluster
> average (113)
> > pool volumes objects per pg (1220) is more than 10.7965 times
> cluster average (113)
> > OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
> > PENDING_CREATING_PGS 6517 PGs pending on creation
> > osds
> [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
> have pending PGs.
> > PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs
> down, 2 pgs peering, 2703 pgs stale
> > pg 10.90e is stuck inactive for 94928.999109, current state
> activating, last acting [2,6]
> > pg 10.913 is stuck inactive for 95094.175400, current state
> activating, last acting [9,5]
> > pg 10.915 is stuck inactive for 94929.184177, current state
> activating, last acting [30,26]
> > pg 11.907 is stuck stale for 9612.906582, current state
> stale+active+clean, last acting [38,24]
> > pg 11.910 is stuck stale for 11822.359237, current state stale+down,
> last acting [21]
> > pg 11.915 is stuck stale for 9612.906604, current state
> stale+active+clean, last acting [38,31]
> > pg 11.919 is stuck inactive for 95636.716568, current state
> activating, last acting [25,12]
> > pg 12.902 is stuck stale for 10810.497213, current state
> stale+activating, last acting [36,14]
> > pg 13.901 is stuck stale for 94889.512234, current state
> stale+active+clean, last acting [1,31]
> > pg 13.904 is stuck stale for 10745.279158, current state
> stale+active+clean, last acting [37,8]
> > pg 13.908 is stuck stale for 10745.279176, current state
> stale+active+clean, last acting [37,19]
> > pg 13.909 is stuck inactive for 95370.129659, current state
> activating, last acting [34,19]
> > pg 13.90e is stuck inactive for 95370.379694, current state
> activating, last acting [21,20]
> > pg 13.911 is stuck inactive for 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
Hi Arun,

actually deleting was no good idea, thats why I wrote, that the OSDs
should be "out".
You have down PGs, that because the data is on OSDs that are
unavailable but known by the cluster.
This can be checked by using "ceph pg 0.5 query" (change PG name).

Because your PG count is so much oversized, the overdose limits get
hit on every recovery on your cluster.
I had the same problem on a medium cluster when I added to many new
disks at once.
You already got this info from Caspar earlier in this thread.

https://ceph.com/planet/placement-groups-with-ceph-luminous-stay-in-activating-state/
https://blog.widodh.nl/2018/01/placement-groups-with-ceph-luminous-stay-in-activating-state/

The second link shows one of the config params you need to inject to
all your OSDs like this:
ceph tell osd.* injectargs --mon_max_pg_per_osd 1

This might help you getting these PGs some sort of "active"
(+recovery/+degraded/+inconsistent/etc.).

The down PGs will most likely never come back. It would bet, you will
find OSD IDs that are invalid in the acting set, meaning that
non-existent OSDs hold your data.
I had a similar problem on a test cluster with erasure code pools
where too many disks failed at the same time, you will then see
negative values as OSD IDs.

Maybe this helps a little bit.

Kevin

Am Sa., 5. Jan. 2019 um 00:20 Uhr schrieb Arun POONIA
:
>
> Hi Kevin,
>
> I tried deleting newly added server from Ceph Cluster and looks like Ceph is 
> not recovering. I agree with unfound data but it doesn't say about unfound 
> data. It says inactive/down for PGs and I can't bring them up.
>
>
> [root@fre101 ~]# ceph health detail
> 2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0) 
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
> bind the UNIX domain socket to 
> '/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2) No 
> such file or directory
> HEALTH_ERR 3 pools have many more objects per pg than average; 
> 523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation; 
> Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering, 
> 2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded 
> (0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are 
> blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs 
> per OSD (3003 > max 200)
> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
> pool glance-images objects per pg (10478) is more than 92.7257 times 
> cluster average (113)
> pool vms objects per pg (4722) is more than 41.7876 times cluster average 
> (113)
> pool volumes objects per pg (1220) is more than 10.7965 times cluster 
> average (113)
> OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
> PENDING_CREATING_PGS 6517 PGs pending on creation
> osds 
> [osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
>  have pending PGs.
> PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs down, 
> 2 pgs peering, 2703 pgs stale
> pg 10.90e is stuck inactive for 94928.999109, current state activating, 
> last acting [2,6]
> pg 10.913 is stuck inactive for 95094.175400, current state activating, 
> last acting [9,5]
> pg 10.915 is stuck inactive for 94929.184177, current state activating, 
> last acting [30,26]
> pg 11.907 is stuck stale for 9612.906582, current state 
> stale+active+clean, last acting [38,24]
> pg 11.910 is stuck stale for 11822.359237, current state stale+down, last 
> acting [21]
> pg 11.915 is stuck stale for 9612.906604, current state 
> stale+active+clean, last acting [38,31]
> pg 11.919 is stuck inactive for 95636.716568, current state activating, 
> last acting [25,12]
> pg 12.902 is stuck stale for 10810.497213, current state 
> stale+activating, last acting [36,14]
> pg 13.901 is stuck stale for 94889.512234, current state 
> stale+active+clean, last acting [1,31]
> pg 13.904 is stuck stale for 10745.279158, current state 
> stale+active+clean, last acting [37,8]
> pg 13.908 is stuck stale for 10745.279176, current state 
> stale+active+clean, last acting [37,19]
> pg 13.909 is stuck inactive for 95370.129659, current state activating, 
> last acting [34,19]
> pg 13.90e is stuck inactive for 95370.379694, current state activating, 
> last acting [21,20]
> pg 13.911 is stuck inactive for 98449.317873, current state activating, 
> last acting [25,22]
> pg 13.914 is stuck stale for 11827.503651, current state stale+down, last 
> acting [29]
> pg 13.917 is stuck inactive for 94564.811121, current state activating, 
> last acting [16,12]
> pg 14.901 is stuck inactive for 94929.006707, current state 
> activating+degraded, last acting 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin,

I tried deleting newly added server from Ceph Cluster and looks like Ceph
is not recovering. I agree with unfound data but it doesn't say about
unfound data. It says inactive/down for PGs and I can't bring them up.


[root@fre101 ~]# ceph health detail
2019-01-04 15:17:05.711641 7f27b0f31700 -1 asok(0x7f27ac0017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.129552.139808366139728.asok': (2)
No such file or directory
HEALTH_ERR 3 pools have many more objects per pg than average;
523656/12393978 objects misplaced (4.225%); 6517 PGs pending on creation;
Reduced data availability: 6585 pgs inactive, 1267 pgs down, 2 pgs peering,
2703 pgs stale; Degraded data redundancy: 86858/12393978 objects degraded
(0.701%), 717 pgs degraded, 21 pgs undersized; 99059 slow requests are
blocked > 32 sec; 4834 stuck requests are blocked > 4096 sec; too many PGs
per OSD (3003 > max 200)
MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
pool glance-images objects per pg (10478) is more than 92.7257 times
cluster average (113)
pool vms objects per pg (4722) is more than 41.7876 times cluster
average (113)
pool volumes objects per pg (1220) is more than 10.7965 times cluster
average (113)
OBJECT_MISPLACED 523656/12393978 objects misplaced (4.225%)
PENDING_CREATING_PGS 6517 PGs pending on creation
osds
[osd.0,osd.1,osd.10,osd.11,osd.12,osd.13,osd.14,osd.15,osd.16,osd.17,osd.18,osd.19,osd.2,osd.20,osd.21,osd.22,osd.23,osd.24,osd.25,osd.26,osd.27,osd.28,osd.29,osd.3,osd.30,osd.31,osd.32,osd.33,osd.34,osd.35,osd.4,osd.5,osd.6,osd.7,osd.8,osd.9]
have pending PGs.
PG_AVAILABILITY Reduced data availability: 6585 pgs inactive, 1267 pgs
down, 2 pgs peering, 2703 pgs stale
pg 10.90e is stuck inactive for 94928.999109, current state activating,
last acting [2,6]
pg 10.913 is stuck inactive for 95094.175400, current state activating,
last acting [9,5]
pg 10.915 is stuck inactive for 94929.184177, current state activating,
last acting [30,26]
pg 11.907 is stuck stale for 9612.906582, current state
stale+active+clean, last acting [38,24]
pg 11.910 is stuck stale for 11822.359237, current state stale+down,
last acting [21]
pg 11.915 is stuck stale for 9612.906604, current state
stale+active+clean, last acting [38,31]
pg 11.919 is stuck inactive for 95636.716568, current state activating,
last acting [25,12]
pg 12.902 is stuck stale for 10810.497213, current state
stale+activating, last acting [36,14]
pg 13.901 is stuck stale for 94889.512234, current state
stale+active+clean, last acting [1,31]
pg 13.904 is stuck stale for 10745.279158, current state
stale+active+clean, last acting [37,8]
pg 13.908 is stuck stale for 10745.279176, current state
stale+active+clean, last acting [37,19]
pg 13.909 is stuck inactive for 95370.129659, current state activating,
last acting [34,19]
pg 13.90e is stuck inactive for 95370.379694, current state activating,
last acting [21,20]
pg 13.911 is stuck inactive for 98449.317873, current state activating,
last acting [25,22]
pg 13.914 is stuck stale for 11827.503651, current state stale+down,
last acting [29]
pg 13.917 is stuck inactive for 94564.811121, current state activating,
last acting [16,12]
pg 14.901 is stuck inactive for 94929.006707, current state
activating+degraded, last acting [22,8]
pg 14.910 is stuck inactive for 94929.046256, current state
activating+degraded, last acting [17,2]
pg 14.912 is stuck inactive for 10831.758524, current state activating,
last acting [18,2]
pg 14.915 is stuck inactive for 94929.001390, current state activating,
last acting [34,23]
pg 15.90c is stuck inactive for 93957.371333, current state activating,
last acting [29,10]
pg 15.90d is stuck inactive for 94929.145438, current state activating,
last acting [5,31]
pg 15.913 is stuck stale for 10745.279197, current state
stale+active+clean, last acting [37,12]
pg 15.915 is stuck stale for 12343.606595, current state stale+down,
last acting [0]
pg 15.91c is stuck stale for 10650.058945, current state stale+down,
last acting [12]
pg 16.90e is stuck inactive for 94929.240626, current state activating,
last acting [14,2]
pg 16.919 is stuck inactive for 94564.771129, current state activating,
last acting [20,4]
pg 16.91e is stuck inactive for 94960.007104, current state activating,
last acting [22,12]
pg 17.908 is stuck inactive for 12250.346380, current state activating,
last acting [27,18]
pg 17.90b is stuck inactive for 11714.951268, current state activating,
last acting [12,25]
pg 17.910 is stuck inactive for 94564.819149, current state activating,
last acting [26,16]
pg 17.913 is stuck inactive for 95370.177309, current state activating,
last acting [13,31]
pg 17.91f is stuck inactive for 95147.032346, current state activating,
last acting [6,18]
 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
I don't think this will help you. Unfound means, the cluster is unable
to find the data anywhere (it's lost).
It would be sufficient to shut down the new host - the OSDs will then be out.

You can also force-heal the cluster, something like "do your best possible":

ceph pg 2.5 mark_unfound_lost revert|delete

Src: http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/

Kevin

Am Fr., 4. Jan. 2019 um 20:47 Uhr schrieb Arun POONIA
:
>
> Hi Kevin,
>
> Can I remove newly added server from Cluster and see if it heals cluster ?
>
> When I check Hard Disk Iops on new server which are very low compared to 
> existing cluster server.
>
> Indeed this is a critical cluster but I don't have expertise to make it 
> flawless.
>
> Thanks
> Arun
>
> On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich  wrote:
>>
>> If you realy created and destroyed OSDs before the cluster healed
>> itself, this data will be permanently lost (not found / inactive).
>> Also your PG count is so much oversized, the calculation for peering
>> will most likely break because this was never tested.
>>
>> If this is a critical cluster, I would start a new one and bring back
>> the backups (using a better PG count).
>>
>> Kevin
>>
>> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
>> :
>> >
>> > Can anyone comment on this issue please, I can't seem to bring my cluster 
>> > healthy.
>> >
>> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA  
>> > wrote:
>> >>
>> >> Hi Caspar,
>> >>
>> >> Number of IOPs are also quite low. It used be around 1K Plus on one of 
>> >> Pool (VMs) now its like close to 10-30 .
>> >>
>> >> Thansk
>> >> Arun
>> >>
>> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA 
>> >>  wrote:
>> >>>
>> >>> Hi Caspar,
>> >>>
>> >>> Yes and No, numbers are going up and down. If I run ceph -s command I 
>> >>> can see it decreases one time and later it increases again. I see there 
>> >>> are so many blocked/slow requests. Almost all the OSDs have slow 
>> >>> requests. Around 12% PGs are inactive not sure how to activate them 
>> >>> again.
>> >>>
>> >>>
>> >>> [root@fre101 ~]# ceph health detail
>> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) 
>> >>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
>> >>> to bind the UNIX domain socket to 
>> >>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': 
>> >>> (2) No such file or directory
>> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than 
>> >>> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on 
>> >>> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 
>> >>> 86 pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 
>> >>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 
>> >>> slow requests are blocked > 32 sec; 551 stuck requests are blocked > 
>> >>> 4096 sec; too many PGs per OSD (2709 > max 200)
>> >>> OSD_DOWN 1 osds down
>> >>> osd.28 (root=default,host=fre119) is down
>> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>> >>> pool glance-images objects per pg (10478) is more than 92.7257 times 
>> >>> cluster average (113)
>> >>> pool vms objects per pg (4717) is more than 41.7434 times cluster 
>> >>> average (113)
>> >>> pool volumes objects per pg (1220) is more than 10.7965 times 
>> >>> cluster average (113)
>> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>> >>> PENDING_CREATING_PGS 3610 PGs pending on creation
>> >>> osds 
>> >>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>> >>>  have pending PGs.
>> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs 
>> >>> down, 86 pgs peering, 850 pgs stale
>> >>> pg 10.900 is down, acting [18]
>> >>> pg 10.90e is stuck inactive for 60266.030164, current state 
>> >>> activating, last acting [2,38]
>> >>> pg 10.913 is stuck stale for 1887.552862, current state stale+down, 
>> >>> last acting [9]
>> >>> pg 10.915 is stuck inactive for 60266.215231, current state 
>> >>> activating, last acting [30,38]
>> >>> pg 11.903 is stuck inactive for 59294.465961, current state 
>> >>> activating, last acting [11,38]
>> >>> pg 11.910 is down, acting [21]
>> >>> pg 11.919 is down, acting [25]
>> >>> pg 12.902 is stuck inactive for 57118.544590, current state 
>> >>> activating, last acting [36,14]
>> >>> pg 13.8f8 is stuck inactive for 60707.167787, current state 
>> >>> activating, last acting [29,37]
>> >>> pg 13.901 is stuck stale for 60226.543289, current state 
>> >>> stale+active+clean, last acting [1,31]
>> >>> pg 13.905 is stuck inactive for 60266.050940, current state 
>> >>> activating, last acting [2,36]
>> >>> pg 13.909 is stuck inactive for 60707.160714, current 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Kevin,

Can I remove newly added server from Cluster and see if it heals cluster ?

When I check Hard Disk Iops on new server which are very low compared to
existing cluster server.

Indeed this is a critical cluster but I don't have expertise to make it
flawless.

Thanks
Arun

On Fri, Jan 4, 2019 at 11:35 AM Kevin Olbrich  wrote:

> If you realy created and destroyed OSDs before the cluster healed
> itself, this data will be permanently lost (not found / inactive).
> Also your PG count is so much oversized, the calculation for peering
> will most likely break because this was never tested.
>
> If this is a critical cluster, I would start a new one and bring back
> the backups (using a better PG count).
>
> Kevin
>
> Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
> :
> >
> > Can anyone comment on this issue please, I can't seem to bring my
> cluster healthy.
> >
> > On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA <
> arun.poo...@nuagenetworks.net> wrote:
> >>
> >> Hi Caspar,
> >>
> >> Number of IOPs are also quite low. It used be around 1K Plus on one of
> Pool (VMs) now its like close to 10-30 .
> >>
> >> Thansk
> >> Arun
> >>
> >> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA <
> arun.poo...@nuagenetworks.net> wrote:
> >>>
> >>> Hi Caspar,
> >>>
> >>> Yes and No, numbers are going up and down. If I run ceph -s command I
> can see it decreases one time and later it increases again. I see there are
> so many blocked/slow requests. Almost all the OSDs have slow requests.
> Around 12% PGs are inactive not sure how to activate them again.
> >>>
> >>>
> >>> [root@fre101 ~]# ceph health detail
> >>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to
> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
> No such file or directory
> >>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than
> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on
> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86
> pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654
> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
> too many PGs per OSD (2709 > max 200)
> >>> OSD_DOWN 1 osds down
> >>> osd.28 (root=default,host=fre119) is down
> >>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
> >>> pool glance-images objects per pg (10478) is more than 92.7257
> times cluster average (113)
> >>> pool vms objects per pg (4717) is more than 41.7434 times cluster
> average (113)
> >>> pool volumes objects per pg (1220) is more than 10.7965 times
> cluster average (113)
> >>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
> >>> PENDING_CREATING_PGS 3610 PGs pending on creation
> >>> osds
> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
> have pending PGs.
> >>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs
> down, 86 pgs peering, 850 pgs stale
> >>> pg 10.900 is down, acting [18]
> >>> pg 10.90e is stuck inactive for 60266.030164, current state
> activating, last acting [2,38]
> >>> pg 10.913 is stuck stale for 1887.552862, current state
> stale+down, last acting [9]
> >>> pg 10.915 is stuck inactive for 60266.215231, current state
> activating, last acting [30,38]
> >>> pg 11.903 is stuck inactive for 59294.465961, current state
> activating, last acting [11,38]
> >>> pg 11.910 is down, acting [21]
> >>> pg 11.919 is down, acting [25]
> >>> pg 12.902 is stuck inactive for 57118.544590, current state
> activating, last acting [36,14]
> >>> pg 13.8f8 is stuck inactive for 60707.167787, current state
> activating, last acting [29,37]
> >>> pg 13.901 is stuck stale for 60226.543289, current state
> stale+active+clean, last acting [1,31]
> >>> pg 13.905 is stuck inactive for 60266.050940, current state
> activating, last acting [2,36]
> >>> pg 13.909 is stuck inactive for 60707.160714, current state
> activating, last acting [34,36]
> >>> pg 13.90e is stuck inactive for 60707.410749, current state
> activating, last acting [21,36]
> >>> pg 13.911 is down, acting [25]
> >>> pg 13.914 is stale+down, acting [29]
> >>> pg 13.917 is stuck stale for 580.224688, current state stale+down,
> last acting [16]
> >>> pg 14.901 is stuck inactive for 60266.037762, current state
> activating+degraded, last acting [22,37]
> >>> pg 14.90f is stuck inactive for 60296.996447, current state
> activating, last acting [30,36]
> >>> pg 14.910 is stuck inactive for 60266.077310, current state
> activating+degraded, last acting [17,37]
> >>>

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Kevin Olbrich
If you realy created and destroyed OSDs before the cluster healed
itself, this data will be permanently lost (not found / inactive).
Also your PG count is so much oversized, the calculation for peering
will most likely break because this was never tested.

If this is a critical cluster, I would start a new one and bring back
the backups (using a better PG count).

Kevin

Am Fr., 4. Jan. 2019 um 20:25 Uhr schrieb Arun POONIA
:
>
> Can anyone comment on this issue please, I can't seem to bring my cluster 
> healthy.
>
> On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA  
> wrote:
>>
>> Hi Caspar,
>>
>> Number of IOPs are also quite low. It used be around 1K Plus on one of Pool 
>> (VMs) now its like close to 10-30 .
>>
>> Thansk
>> Arun
>>
>> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA  
>> wrote:
>>>
>>> Hi Caspar,
>>>
>>> Yes and No, numbers are going up and down. If I run ceph -s command I can 
>>> see it decreases one time and later it increases again. I see there are so 
>>> many blocked/slow requests. Almost all the OSDs have slow requests. Around 
>>> 12% PGs are inactive not sure how to activate them again.
>>>
>>>
>>> [root@fre101 ~]# ceph health detail
>>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0) 
>>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
>>> bind the UNIX domain socket to 
>>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2) 
>>> No such file or directory
>>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average; 
>>> 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation; 
>>> Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs 
>>> peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects 
>>> degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow 
>>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec; 
>>> too many PGs per OSD (2709 > max 200)
>>> OSD_DOWN 1 osds down
>>> osd.28 (root=default,host=fre119) is down
>>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>>> pool glance-images objects per pg (10478) is more than 92.7257 times 
>>> cluster average (113)
>>> pool vms objects per pg (4717) is more than 41.7434 times cluster 
>>> average (113)
>>> pool volumes objects per pg (1220) is more than 10.7965 times cluster 
>>> average (113)
>>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>>> PENDING_CREATING_PGS 3610 PGs pending on creation
>>> osds 
>>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>>>  have pending PGs.
>>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs 
>>> down, 86 pgs peering, 850 pgs stale
>>> pg 10.900 is down, acting [18]
>>> pg 10.90e is stuck inactive for 60266.030164, current state activating, 
>>> last acting [2,38]
>>> pg 10.913 is stuck stale for 1887.552862, current state stale+down, 
>>> last acting [9]
>>> pg 10.915 is stuck inactive for 60266.215231, current state activating, 
>>> last acting [30,38]
>>> pg 11.903 is stuck inactive for 59294.465961, current state activating, 
>>> last acting [11,38]
>>> pg 11.910 is down, acting [21]
>>> pg 11.919 is down, acting [25]
>>> pg 12.902 is stuck inactive for 57118.544590, current state activating, 
>>> last acting [36,14]
>>> pg 13.8f8 is stuck inactive for 60707.167787, current state activating, 
>>> last acting [29,37]
>>> pg 13.901 is stuck stale for 60226.543289, current state 
>>> stale+active+clean, last acting [1,31]
>>> pg 13.905 is stuck inactive for 60266.050940, current state activating, 
>>> last acting [2,36]
>>> pg 13.909 is stuck inactive for 60707.160714, current state activating, 
>>> last acting [34,36]
>>> pg 13.90e is stuck inactive for 60707.410749, current state activating, 
>>> last acting [21,36]
>>> pg 13.911 is down, acting [25]
>>> pg 13.914 is stale+down, acting [29]
>>> pg 13.917 is stuck stale for 580.224688, current state stale+down, last 
>>> acting [16]
>>> pg 14.901 is stuck inactive for 60266.037762, current state 
>>> activating+degraded, last acting [22,37]
>>> pg 14.90f is stuck inactive for 60296.996447, current state activating, 
>>> last acting [30,36]
>>> pg 14.910 is stuck inactive for 60266.077310, current state 
>>> activating+degraded, last acting [17,37]
>>> pg 14.915 is stuck inactive for 60266.032445, current state activating, 
>>> last acting [34,36]
>>> pg 15.8fa is stuck stale for 560.223249, current state stale+down, last 
>>> acting [8]
>>> pg 15.90c is stuck inactive for 59294.402388, current state activating, 
>>> last acting [29,38]
>>> pg 15.90d is stuck inactive for 60266.176492, current state activating, 
>>> last acting [5,36]
>>> pg 15.915 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Can anyone comment on this issue please, I can't seem to bring my cluster
healthy.

On Fri, Jan 4, 2019 at 6:26 AM Arun POONIA 
wrote:

> Hi Caspar,
>
> Number of IOPs are also quite low. It used be around 1K Plus on one of
> Pool (VMs) now its like close to 10-30 .
>
> Thansk
> Arun
>
> On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA 
> wrote:
>
>> Hi Caspar,
>>
>> Yes and No, numbers are going up and down. If I run ceph -s command I can
>> see it decreases one time and later it increases again. I see there are so
>> many blocked/slow requests. Almost all the OSDs have slow requests. Around
>> 12% PGs are inactive not sure how to activate them again.
>>
>>
>> [root@fre101 ~]# ceph health detail
>> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
>> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
>> bind the UNIX domain socket to
>> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
>> No such file or directory
>> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than
>> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on
>> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86
>> pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654
>> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
>> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
>> too many PGs per OSD (2709 > max 200)
>> OSD_DOWN 1 osds down
>> osd.28 (root=default,host=fre119) is down
>> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
>> pool glance-images objects per pg (10478) is more than 92.7257 times
>> cluster average (113)
>> pool vms objects per pg (4717) is more than 41.7434 times cluster
>> average (113)
>> pool volumes objects per pg (1220) is more than 10.7965 times cluster
>> average (113)
>> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
>> PENDING_CREATING_PGS 3610 PGs pending on creation
>> osds
>> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
>> have pending PGs.
>> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs
>> down, 86 pgs peering, 850 pgs stale
>> pg 10.900 is down, acting [18]
>> pg 10.90e is stuck inactive for 60266.030164, current state
>> activating, last acting [2,38]
>> pg 10.913 is stuck stale for 1887.552862, current state stale+down,
>> last acting [9]
>> pg 10.915 is stuck inactive for 60266.215231, current state
>> activating, last acting [30,38]
>> pg 11.903 is stuck inactive for 59294.465961, current state
>> activating, last acting [11,38]
>> pg 11.910 is down, acting [21]
>> pg 11.919 is down, acting [25]
>> pg 12.902 is stuck inactive for 57118.544590, current state
>> activating, last acting [36,14]
>> pg 13.8f8 is stuck inactive for 60707.167787, current state
>> activating, last acting [29,37]
>> pg 13.901 is stuck stale for 60226.543289, current state
>> stale+active+clean, last acting [1,31]
>> pg 13.905 is stuck inactive for 60266.050940, current state
>> activating, last acting [2,36]
>> pg 13.909 is stuck inactive for 60707.160714, current state
>> activating, last acting [34,36]
>> pg 13.90e is stuck inactive for 60707.410749, current state
>> activating, last acting [21,36]
>> pg 13.911 is down, acting [25]
>> pg 13.914 is stale+down, acting [29]
>> pg 13.917 is stuck stale for 580.224688, current state stale+down,
>> last acting [16]
>> pg 14.901 is stuck inactive for 60266.037762, current state
>> activating+degraded, last acting [22,37]
>> pg 14.90f is stuck inactive for 60296.996447, current state
>> activating, last acting [30,36]
>> pg 14.910 is stuck inactive for 60266.077310, current state
>> activating+degraded, last acting [17,37]
>> pg 14.915 is stuck inactive for 60266.032445, current state
>> activating, last acting [34,36]
>> pg 15.8fa is stuck stale for 560.223249, current state stale+down,
>> last acting [8]
>> pg 15.90c is stuck inactive for 59294.402388, current state
>> activating, last acting [29,38]
>> pg 15.90d is stuck inactive for 60266.176492, current state
>> activating, last acting [5,36]
>> pg 15.915 is down, acting [0]
>> pg 15.917 is stuck inactive for 56279.658951, current state
>> activating, last acting [13,38]
>> pg 15.91c is stuck stale for 374.590704, current state stale+down,
>> last acting [12]
>> pg 16.903 is stuck inactive for 56580.905961, current state
>> activating, last acting [25,37]
>> pg 16.90e is stuck inactive for 60266.271680, current state
>> activating, last acting [14,37]
>> pg 16.919 is stuck inactive for 59901.802184, current state
>> activating, last acting [20,37]
>> pg 16.91e is stuck inactive for 60297.038159, 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar,

Number of IOPs are also quite low. It used be around 1K Plus on one of Pool
(VMs) now its like close to 10-30 .

Thansk
Arun

On Fri, Jan 4, 2019 at 5:41 AM Arun POONIA 
wrote:

> Hi Caspar,
>
> Yes and No, numbers are going up and down. If I run ceph -s command I can
> see it decreases one time and later it increases again. I see there are so
> many blocked/slow requests. Almost all the OSDs have slow requests. Around
> 12% PGs are inactive not sure how to activate them again.
>
>
> [root@fre101 ~]# ceph health detail
> 2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to
> '/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
> No such file or directory
> HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than
> average; 472812/12392654 objects misplaced (3.815%); 3610 PGs pending on
> creation; Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86
> pgs peering, 850 pgs stale; Degraded data redundancy: 216694/12392654
> objects degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
> requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
> too many PGs per OSD (2709 > max 200)
> OSD_DOWN 1 osds down
> osd.28 (root=default,host=fre119) is down
> MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
> pool glance-images objects per pg (10478) is more than 92.7257 times
> cluster average (113)
> pool vms objects per pg (4717) is more than 41.7434 times cluster
> average (113)
> pool volumes objects per pg (1220) is more than 10.7965 times cluster
> average (113)
> OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
> PENDING_CREATING_PGS 3610 PGs pending on creation
> osds
> [osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
> have pending PGs.
> PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs
> down, 86 pgs peering, 850 pgs stale
> pg 10.900 is down, acting [18]
> pg 10.90e is stuck inactive for 60266.030164, current state
> activating, last acting [2,38]
> pg 10.913 is stuck stale for 1887.552862, current state stale+down,
> last acting [9]
> pg 10.915 is stuck inactive for 60266.215231, current state
> activating, last acting [30,38]
> pg 11.903 is stuck inactive for 59294.465961, current state
> activating, last acting [11,38]
> pg 11.910 is down, acting [21]
> pg 11.919 is down, acting [25]
> pg 12.902 is stuck inactive for 57118.544590, current state
> activating, last acting [36,14]
> pg 13.8f8 is stuck inactive for 60707.167787, current state
> activating, last acting [29,37]
> pg 13.901 is stuck stale for 60226.543289, current state
> stale+active+clean, last acting [1,31]
> pg 13.905 is stuck inactive for 60266.050940, current state
> activating, last acting [2,36]
> pg 13.909 is stuck inactive for 60707.160714, current state
> activating, last acting [34,36]
> pg 13.90e is stuck inactive for 60707.410749, current state
> activating, last acting [21,36]
> pg 13.911 is down, acting [25]
> pg 13.914 is stale+down, acting [29]
> pg 13.917 is stuck stale for 580.224688, current state stale+down,
> last acting [16]
> pg 14.901 is stuck inactive for 60266.037762, current state
> activating+degraded, last acting [22,37]
> pg 14.90f is stuck inactive for 60296.996447, current state
> activating, last acting [30,36]
> pg 14.910 is stuck inactive for 60266.077310, current state
> activating+degraded, last acting [17,37]
> pg 14.915 is stuck inactive for 60266.032445, current state
> activating, last acting [34,36]
> pg 15.8fa is stuck stale for 560.223249, current state stale+down,
> last acting [8]
> pg 15.90c is stuck inactive for 59294.402388, current state
> activating, last acting [29,38]
> pg 15.90d is stuck inactive for 60266.176492, current state
> activating, last acting [5,36]
> pg 15.915 is down, acting [0]
> pg 15.917 is stuck inactive for 56279.658951, current state
> activating, last acting [13,38]
> pg 15.91c is stuck stale for 374.590704, current state stale+down,
> last acting [12]
> pg 16.903 is stuck inactive for 56580.905961, current state
> activating, last acting [25,37]
> pg 16.90e is stuck inactive for 60266.271680, current state
> activating, last acting [14,37]
> pg 16.919 is stuck inactive for 59901.802184, current state
> activating, last acting [20,37]
> pg 16.91e is stuck inactive for 60297.038159, current state
> activating, last acting [22,37]
> pg 17.8e5 is stuck inactive for 60266.149061, current state
> activating, last acting [25,36]
> pg 17.910 is stuck inactive for 59901.850204, current state
> activating, last acting 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar,

Yes and No, numbers are going up and down. If I run ceph -s command I can
see it decreases one time and later it increases again. I see there are so
many blocked/slow requests. Almost all the OSDs have slow requests. Around
12% PGs are inactive not sure how to activate them again.


[root@fre101 ~]# ceph health detail
2019-01-04 05:39:23.860142 7fc37a3a0700 -1 asok(0x7fc3740017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.1066526.140477441513808.asok': (2)
No such file or directory
HEALTH_ERR 1 osds down; 3 pools have many more objects per pg than average;
472812/12392654 objects misplaced (3.815%); 3610 PGs pending on creation;
Reduced data availability: 6578 pgs inactive, 1882 pgs down, 86 pgs
peering, 850 pgs stale; Degraded data redundancy: 216694/12392654 objects
degraded (1.749%), 866 pgs degraded, 16 pgs undersized; 116082 slow
requests are blocked > 32 sec; 551 stuck requests are blocked > 4096 sec;
too many PGs per OSD (2709 > max 200)
OSD_DOWN 1 osds down
osd.28 (root=default,host=fre119) is down
MANY_OBJECTS_PER_PG 3 pools have many more objects per pg than average
pool glance-images objects per pg (10478) is more than 92.7257 times
cluster average (113)
pool vms objects per pg (4717) is more than 41.7434 times cluster
average (113)
pool volumes objects per pg (1220) is more than 10.7965 times cluster
average (113)
OBJECT_MISPLACED 472812/12392654 objects misplaced (3.815%)
PENDING_CREATING_PGS 3610 PGs pending on creation
osds
[osd.0,osd.1,osd.10,osd.11,osd.14,osd.15,osd.17,osd.18,osd.19,osd.20,osd.21,osd.22,osd.23,osd.25,osd.26,osd.27,osd.28,osd.3,osd.30,osd.32,osd.33,osd.35,osd.36,osd.37,osd.38,osd.4,osd.5,osd.6,osd.7,osd.9]
have pending PGs.
PG_AVAILABILITY Reduced data availability: 6578 pgs inactive, 1882 pgs
down, 86 pgs peering, 850 pgs stale
pg 10.900 is down, acting [18]
pg 10.90e is stuck inactive for 60266.030164, current state activating,
last acting [2,38]
pg 10.913 is stuck stale for 1887.552862, current state stale+down,
last acting [9]
pg 10.915 is stuck inactive for 60266.215231, current state activating,
last acting [30,38]
pg 11.903 is stuck inactive for 59294.465961, current state activating,
last acting [11,38]
pg 11.910 is down, acting [21]
pg 11.919 is down, acting [25]
pg 12.902 is stuck inactive for 57118.544590, current state activating,
last acting [36,14]
pg 13.8f8 is stuck inactive for 60707.167787, current state activating,
last acting [29,37]
pg 13.901 is stuck stale for 60226.543289, current state
stale+active+clean, last acting [1,31]
pg 13.905 is stuck inactive for 60266.050940, current state activating,
last acting [2,36]
pg 13.909 is stuck inactive for 60707.160714, current state activating,
last acting [34,36]
pg 13.90e is stuck inactive for 60707.410749, current state activating,
last acting [21,36]
pg 13.911 is down, acting [25]
pg 13.914 is stale+down, acting [29]
pg 13.917 is stuck stale for 580.224688, current state stale+down, last
acting [16]
pg 14.901 is stuck inactive for 60266.037762, current state
activating+degraded, last acting [22,37]
pg 14.90f is stuck inactive for 60296.996447, current state activating,
last acting [30,36]
pg 14.910 is stuck inactive for 60266.077310, current state
activating+degraded, last acting [17,37]
pg 14.915 is stuck inactive for 60266.032445, current state activating,
last acting [34,36]
pg 15.8fa is stuck stale for 560.223249, current state stale+down, last
acting [8]
pg 15.90c is stuck inactive for 59294.402388, current state activating,
last acting [29,38]
pg 15.90d is stuck inactive for 60266.176492, current state activating,
last acting [5,36]
pg 15.915 is down, acting [0]
pg 15.917 is stuck inactive for 56279.658951, current state activating,
last acting [13,38]
pg 15.91c is stuck stale for 374.590704, current state stale+down, last
acting [12]
pg 16.903 is stuck inactive for 56580.905961, current state activating,
last acting [25,37]
pg 16.90e is stuck inactive for 60266.271680, current state activating,
last acting [14,37]
pg 16.919 is stuck inactive for 59901.802184, current state activating,
last acting [20,37]
pg 16.91e is stuck inactive for 60297.038159, current state activating,
last acting [22,37]
pg 17.8e5 is stuck inactive for 60266.149061, current state activating,
last acting [25,36]
pg 17.910 is stuck inactive for 59901.850204, current state activating,
last acting [26,37]
pg 17.913 is stuck inactive for 60707.208364, current state activating,
last acting [13,36]
pg 17.91a is stuck inactive for 60266.187509, current state activating,
last acting [4,37]
pg 17.91f is down, acting [6]
pg 18.908 is stuck inactive for 60707.216314, current state activating,
last acting [10,36]
pg 18.911 is stuck stale for 244.570413, 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Caspar Smit
Are the numbers still decreasing?

This one for instance:

"3883 PGs pending on creation"

Caspar


Op vr 4 jan. 2019 om 14:23 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:

> Hi Caspar,
>
> Yes, cluster was working fine with number of PGs per OSD warning up until
> now. I am not sure how to recover from stale down/inactive PGs. If you
> happen to know about this can you let me know?
>
> Current State:
>
> [root@fre101 ~]# ceph -s
> 2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to
> '/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2)
> No such file or directory
>   cluster:
> id: adb9ad8e-f458-4124-bf58-7963a8d1391f
> health: HEALTH_ERR
> 3 pools have many more objects per pg than average
> 505714/12392650 objects misplaced (4.081%)
> 3883 PGs pending on creation
> Reduced data availability: 6519 pgs inactive, 1870 pgs down, 1
> pg peering, 886 pgs stale
> Degraded data redundancy: 42987/12392650 objects degraded
> (0.347%), 634 pgs degraded, 16 pgs undersized
> 125827 slow requests are blocked > 32 sec
> 2 stuck requests are blocked > 4096 sec
> too many PGs per OSD (2758 > max 200)
>
>   services:
> mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
> mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
> osd: 39 osds: 39 up, 39 in; 76 remapped pgs
> rgw: 1 daemon active
>
>   data:
> pools:   18 pools, 54656 pgs
> objects: 6051k objects, 10944 GB
> usage:   21933 GB used, 50688 GB / 72622 GB avail
> pgs: 11.927% pgs not active
>  42987/12392650 objects degraded (0.347%)
>  505714/12392650 objects misplaced (4.081%)
>  48080 active+clean
>  3885  activating
>    down
>  759   stale+down
>  614   activating+degraded
>  74activating+remapped
>  46stale+active+clean
>  35stale+activating
>  21stale+activating+remapped
>  9 stale+active+undersized
>  9 stale+activating+degraded
>  5 stale+activating+undersized+degraded+remapped
>  3 activating+degraded+remapped
>  1 stale+activating+degraded+remapped
>  1 stale+active+undersized+degraded
>  1 remapped+peering
>  1 active+clean+remapped
>  1 activating+undersized+degraded+remapped
>
>   io:
> client:   0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr
>
> I will update number of PGs per OSD once these inactive or stale PGs come
> online. I am not able to access VMs (VMs, Images) which are using Ceph.
>
> Thanks
> Arun
>
> On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit  wrote:
>
>> Hi Arun,
>>
>> How did you end up with a 'working' cluster with so many pgs per OSD?
>>
>> "too many PGs per OSD (2968 > max 200)"
>>
>> To (temporarily) allow this kind of pgs per osd you could try this:
>>
>> Change these values in the global section in your ceph.conf:
>>
>> mon max pg per osd = 200
>> osd max pg per osd hard ratio = 2
>>
>> It allows 200*2 = 400 Pgs per OSD before disabling the creation of new
>> pgs.
>>
>> Above are the defaults (for Luminous, maybe other versions too)
>> You can check your current settings with:
>>
>> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd
>>
>> Since your current pgs per osd ratio is way higher then the default you
>> could set them to for instance:
>>
>> mon max pg per osd = 1000
>> osd max pg per osd hard ratio = 5
>>
>> Which allow for 5000 pgs per osd before disabling creation of new pgs.
>>
>> You'll need to inject the setting into the mons/osds and restart mgrs to
>> make them active.
>>
>> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’
>> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
>> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
>> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
>> restart mgrs
>>
>> Kind regards,
>> Caspar
>>
>>
>> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <
>> arun.poo...@nuagenetworks.net>:
>>
>>> Hi Chris,
>>>
>>> Indeed that's what happened. I didn't set noout flag either and I did
>>> zapped disk on new server every time. In my cluster status fre201 is only
>>> new server.
>>>
>>> Current Status after enabling 3 OSDs on fre201 host.
>>>
>>> [root@fre201 ~]# ceph osd tree
>>> ID  CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
>>>  -1   70.92137 root default
>>>  -25.45549 host fre101
>>>   0   hdd  1.81850 osd.0   up  1.0 1.0
>>>   1   hdd  1.81850 osd.1   up  1.0 1.0
>>>   2   hdd  1.81850 osd.2   up  1.0 1.0
>>>  -95.45549 host fre103
>>>   3   hdd  

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Arun POONIA
Hi Caspar,

Yes, cluster was working fine with number of PGs per OSD warning up until
now. I am not sure how to recover from stale down/inactive PGs. If you
happen to know about this can you let me know?

Current State:

[root@fre101 ~]# ceph -s
2019-01-04 05:22:05.942349 7f314f613700 -1 asok(0x7f31480017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.1053724.139849638091088.asok': (2)
No such file or directory
  cluster:
id: adb9ad8e-f458-4124-bf58-7963a8d1391f
health: HEALTH_ERR
3 pools have many more objects per pg than average
505714/12392650 objects misplaced (4.081%)
3883 PGs pending on creation
Reduced data availability: 6519 pgs inactive, 1870 pgs down, 1
pg peering, 886 pgs stale
Degraded data redundancy: 42987/12392650 objects degraded
(0.347%), 634 pgs degraded, 16 pgs undersized
125827 slow requests are blocked > 32 sec
2 stuck requests are blocked > 4096 sec
too many PGs per OSD (2758 > max 200)

  services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
osd: 39 osds: 39 up, 39 in; 76 remapped pgs
rgw: 1 daemon active

  data:
pools:   18 pools, 54656 pgs
objects: 6051k objects, 10944 GB
usage:   21933 GB used, 50688 GB / 72622 GB avail
pgs: 11.927% pgs not active
 42987/12392650 objects degraded (0.347%)
 505714/12392650 objects misplaced (4.081%)
 48080 active+clean
 3885  activating
   down
 759   stale+down
 614   activating+degraded
 74activating+remapped
 46stale+active+clean
 35stale+activating
 21stale+activating+remapped
 9 stale+active+undersized
 9 stale+activating+degraded
 5 stale+activating+undersized+degraded+remapped
 3 activating+degraded+remapped
 1 stale+activating+degraded+remapped
 1 stale+active+undersized+degraded
 1 remapped+peering
 1 active+clean+remapped
 1 activating+undersized+degraded+remapped

  io:
client:   0 B/s rd, 25397 B/s wr, 4 op/s rd, 4 op/s wr

I will update number of PGs per OSD once these inactive or stale PGs come
online. I am not able to access VMs (VMs, Images) which are using Ceph.

Thanks
Arun

On Fri, Jan 4, 2019 at 4:53 AM Caspar Smit  wrote:

> Hi Arun,
>
> How did you end up with a 'working' cluster with so many pgs per OSD?
>
> "too many PGs per OSD (2968 > max 200)"
>
> To (temporarily) allow this kind of pgs per osd you could try this:
>
> Change these values in the global section in your ceph.conf:
>
> mon max pg per osd = 200
> osd max pg per osd hard ratio = 2
>
> It allows 200*2 = 400 Pgs per OSD before disabling the creation of new
> pgs.
>
> Above are the defaults (for Luminous, maybe other versions too)
> You can check your current settings with:
>
> ceph daemon mon.ceph-mon01 config show |grep pg_per_osd
>
> Since your current pgs per osd ratio is way higher then the default you
> could set them to for instance:
>
> mon max pg per osd = 1000
> osd max pg per osd hard ratio = 5
>
> Which allow for 5000 pgs per osd before disabling creation of new pgs.
>
> You'll need to inject the setting into the mons/osds and restart mgrs to
> make them active.
>
> ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’
> ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
> ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
> ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
> restart mgrs
>
> Kind regards,
> Caspar
>
>
> Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <
> arun.poo...@nuagenetworks.net>:
>
>> Hi Chris,
>>
>> Indeed that's what happened. I didn't set noout flag either and I did
>> zapped disk on new server every time. In my cluster status fre201 is only
>> new server.
>>
>> Current Status after enabling 3 OSDs on fre201 host.
>>
>> [root@fre201 ~]# ceph osd tree
>> ID  CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
>>  -1   70.92137 root default
>>  -25.45549 host fre101
>>   0   hdd  1.81850 osd.0   up  1.0 1.0
>>   1   hdd  1.81850 osd.1   up  1.0 1.0
>>   2   hdd  1.81850 osd.2   up  1.0 1.0
>>  -95.45549 host fre103
>>   3   hdd  1.81850 osd.3   up  1.0 1.0
>>   4   hdd  1.81850 osd.4   up  1.0 1.0
>>   5   hdd  1.81850 osd.5   up  1.0 1.0
>>  -35.45549 host fre105
>>   6   hdd  1.81850 osd.6   up  1.0 1.0
>>   7   hdd  1.81850 osd.7   up  1.0 1.0
>>   8   hdd  1.81850 osd.8   up  

Re: [ceph-users] Help Ceph Cluster Down

2019-01-04 Thread Caspar Smit
Hi Arun,

How did you end up with a 'working' cluster with so many pgs per OSD?

"too many PGs per OSD (2968 > max 200)"

To (temporarily) allow this kind of pgs per osd you could try this:

Change these values in the global section in your ceph.conf:

mon max pg per osd = 200
osd max pg per osd hard ratio = 2

It allows 200*2 = 400 Pgs per OSD before disabling the creation of new
pgs.

Above are the defaults (for Luminous, maybe other versions too)
You can check your current settings with:

ceph daemon mon.ceph-mon01 config show |grep pg_per_osd

Since your current pgs per osd ratio is way higher then the default you
could set them to for instance:

mon max pg per osd = 1000
osd max pg per osd hard ratio = 5

Which allow for 5000 pgs per osd before disabling creation of new pgs.

You'll need to inject the setting into the mons/osds and restart mgrs to
make them active.

ceph tell mon.* injectargs ‘--mon_max_pg_per_osd 1000’
ceph tell mon.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
ceph tell osd.* injectargs ‘--mon_max_pg_per_osd 1000’
ceph tell osd.* injectargs ‘--osd_max_pg_per_osd_hard_ratio 5’
restart mgrs

Kind regards,
Caspar


Op vr 4 jan. 2019 om 04:28 schreef Arun POONIA <
arun.poo...@nuagenetworks.net>:

> Hi Chris,
>
> Indeed that's what happened. I didn't set noout flag either and I did
> zapped disk on new server every time. In my cluster status fre201 is only
> new server.
>
> Current Status after enabling 3 OSDs on fre201 host.
>
> [root@fre201 ~]# ceph osd tree
> ID  CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
>  -1   70.92137 root default
>  -25.45549 host fre101
>   0   hdd  1.81850 osd.0   up  1.0 1.0
>   1   hdd  1.81850 osd.1   up  1.0 1.0
>   2   hdd  1.81850 osd.2   up  1.0 1.0
>  -95.45549 host fre103
>   3   hdd  1.81850 osd.3   up  1.0 1.0
>   4   hdd  1.81850 osd.4   up  1.0 1.0
>   5   hdd  1.81850 osd.5   up  1.0 1.0
>  -35.45549 host fre105
>   6   hdd  1.81850 osd.6   up  1.0 1.0
>   7   hdd  1.81850 osd.7   up  1.0 1.0
>   8   hdd  1.81850 osd.8   up  1.0 1.0
>  -45.45549 host fre107
>   9   hdd  1.81850 osd.9   up  1.0 1.0
>  10   hdd  1.81850 osd.10  up  1.0 1.0
>  11   hdd  1.81850 osd.11  up  1.0 1.0
>  -55.45549 host fre109
>  12   hdd  1.81850 osd.12  up  1.0 1.0
>  13   hdd  1.81850 osd.13  up  1.0 1.0
>  14   hdd  1.81850 osd.14  up  1.0 1.0
>  -65.45549 host fre111
>  15   hdd  1.81850 osd.15  up  1.0 1.0
>  16   hdd  1.81850 osd.16  up  1.0 1.0
>  17   hdd  1.81850 osd.17  up  0.7 1.0
>  -75.45549 host fre113
>  18   hdd  1.81850 osd.18  up  1.0 1.0
>  19   hdd  1.81850 osd.19  up  1.0 1.0
>  20   hdd  1.81850 osd.20  up  1.0 1.0
>  -85.45549 host fre115
>  21   hdd  1.81850 osd.21  up  1.0 1.0
>  22   hdd  1.81850 osd.22  up  1.0 1.0
>  23   hdd  1.81850 osd.23  up  1.0 1.0
> -105.45549 host fre117
>  24   hdd  1.81850 osd.24  up  1.0 1.0
>  25   hdd  1.81850 osd.25  up  1.0 1.0
>  26   hdd  1.81850 osd.26  up  1.0 1.0
> -115.45549 host fre119
>  27   hdd  1.81850 osd.27  up  1.0 1.0
>  28   hdd  1.81850 osd.28  up  1.0 1.0
>  29   hdd  1.81850 osd.29  up  1.0 1.0
> -125.45549 host fre121
>  30   hdd  1.81850 osd.30  up  1.0 1.0
>  31   hdd  1.81850 osd.31  up  1.0 1.0
>  32   hdd  1.81850 osd.32  up  1.0 1.0
> -135.45549 host fre123
>  33   hdd  1.81850 osd.33  up  1.0 1.0
>  34   hdd  1.81850 osd.34  up  1.0 1.0
>  35   hdd  1.81850 osd.35  up  1.0 1.0
> -275.45549 host fre201
>  36   hdd  1.81850 osd.36  up  1.0 1.0
>  37   hdd  1.81850 osd.37  up  1.0 1.0
>  38   hdd  1.81850 osd.38  up  1.0 1.0
> [root@fre201 ~]#
> [root@fre201 ~]#
> [root@fre201 ~]#
> [root@fre201 ~]#
> [root@fre201 ~]#
> [root@fre201 ~]# ceph -s
>   cluster:
> id: adb9ad8e-f458-4124-bf58-7963a8d1391f
> health: HEALTH_ERR
> 3 pools have many more objects per pg than average
> 585791/12391450 objects misplaced (4.727%)
> 2 scrub errors
> 2374 PGs pending on creation
> Reduced data availability: 6578 pgs inactive, 2025 pgs down,
> 74 pgs peering, 1234 pgs stale
> Possible 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Arun POONIA
Hi Chris,

Indeed that's what happened. I didn't set noout flag either and I did
zapped disk on new server every time. In my cluster status fre201 is only
new server.

Current Status after enabling 3 OSDs on fre201 host.

[root@fre201 ~]# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
 -1   70.92137 root default
 -25.45549 host fre101
  0   hdd  1.81850 osd.0   up  1.0 1.0
  1   hdd  1.81850 osd.1   up  1.0 1.0
  2   hdd  1.81850 osd.2   up  1.0 1.0
 -95.45549 host fre103
  3   hdd  1.81850 osd.3   up  1.0 1.0
  4   hdd  1.81850 osd.4   up  1.0 1.0
  5   hdd  1.81850 osd.5   up  1.0 1.0
 -35.45549 host fre105
  6   hdd  1.81850 osd.6   up  1.0 1.0
  7   hdd  1.81850 osd.7   up  1.0 1.0
  8   hdd  1.81850 osd.8   up  1.0 1.0
 -45.45549 host fre107
  9   hdd  1.81850 osd.9   up  1.0 1.0
 10   hdd  1.81850 osd.10  up  1.0 1.0
 11   hdd  1.81850 osd.11  up  1.0 1.0
 -55.45549 host fre109
 12   hdd  1.81850 osd.12  up  1.0 1.0
 13   hdd  1.81850 osd.13  up  1.0 1.0
 14   hdd  1.81850 osd.14  up  1.0 1.0
 -65.45549 host fre111
 15   hdd  1.81850 osd.15  up  1.0 1.0
 16   hdd  1.81850 osd.16  up  1.0 1.0
 17   hdd  1.81850 osd.17  up  0.7 1.0
 -75.45549 host fre113
 18   hdd  1.81850 osd.18  up  1.0 1.0
 19   hdd  1.81850 osd.19  up  1.0 1.0
 20   hdd  1.81850 osd.20  up  1.0 1.0
 -85.45549 host fre115
 21   hdd  1.81850 osd.21  up  1.0 1.0
 22   hdd  1.81850 osd.22  up  1.0 1.0
 23   hdd  1.81850 osd.23  up  1.0 1.0
-105.45549 host fre117
 24   hdd  1.81850 osd.24  up  1.0 1.0
 25   hdd  1.81850 osd.25  up  1.0 1.0
 26   hdd  1.81850 osd.26  up  1.0 1.0
-115.45549 host fre119
 27   hdd  1.81850 osd.27  up  1.0 1.0
 28   hdd  1.81850 osd.28  up  1.0 1.0
 29   hdd  1.81850 osd.29  up  1.0 1.0
-125.45549 host fre121
 30   hdd  1.81850 osd.30  up  1.0 1.0
 31   hdd  1.81850 osd.31  up  1.0 1.0
 32   hdd  1.81850 osd.32  up  1.0 1.0
-135.45549 host fre123
 33   hdd  1.81850 osd.33  up  1.0 1.0
 34   hdd  1.81850 osd.34  up  1.0 1.0
 35   hdd  1.81850 osd.35  up  1.0 1.0
-275.45549 host fre201
 36   hdd  1.81850 osd.36  up  1.0 1.0
 37   hdd  1.81850 osd.37  up  1.0 1.0
 38   hdd  1.81850 osd.38  up  1.0 1.0
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]#
[root@fre201 ~]# ceph -s
  cluster:
id: adb9ad8e-f458-4124-bf58-7963a8d1391f
health: HEALTH_ERR
3 pools have many more objects per pg than average
585791/12391450 objects misplaced (4.727%)
2 scrub errors
2374 PGs pending on creation
Reduced data availability: 6578 pgs inactive, 2025 pgs down, 74
pgs peering, 1234 pgs stale
Possible data damage: 2 pgs inconsistent
Degraded data redundancy: 64969/12391450 objects degraded
(0.524%), 616 pgs degraded, 20 pgs undersized
96242 slow requests are blocked > 32 sec
228 stuck requests are blocked > 4096 sec
too many PGs per OSD (2768 > max 200)

  services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
osd: 39 osds: 39 up, 39 in; 96 remapped pgs
rgw: 1 daemon active

  data:
pools:   18 pools, 54656 pgs
objects: 6050k objects, 10942 GB
usage:   21900 GB used, 50721 GB / 72622 GB avail
pgs: 0.002% pgs unknown
 12.050% pgs not active
 64969/12391450 objects degraded (0.524%)
 585791/12391450 objects misplaced (4.727%)
 47489 active+clean
 3670  activating
 1098  stale+down
 923   down
 575   activating+degraded
 563   stale+active+clean
 105   stale+activating
 78activating+remapped
 72peering
 25stale+activating+degraded
 23stale+activating+remapped
 9 stale+active+undersized
 6 stale+activating+undersized+degraded+remapped
 5 stale+active+undersized+degraded
 4 

Re: [ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Chris
If you added OSDs and then deleted them repeatedly without waiting for 
replication to finish as the cluster attempted to re-balance across them, 
its highly likely that you are permanently missing PGs (especially if the 
disks were zapped each time).


If those 3 down OSDs can be revived there is a (small) chance that you can 
right the ship, but 1400pg/OSD is pretty extreme.  I'm surprised the 
cluster even let you do that - this sounds like a data loss event.



Bring back the 3 OSD and see what those 2 inconsistent pgs look like with 
ceph pg query.


On January 3, 2019 21:59:38 Arun POONIA  wrote:

Hi,

Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy 
tool. Since I was experimenting with tool and ended up deleting OSD nodes 
on new server couple of times.


Now since ceph OSDs are running on new server cluster PGs seems to be 
inactive (10-15%) and they are not recovering or rebalancing. Not sure what 
to do. I tried shutting down OSDs on new server.


Status:
[root@fre105 ~]# ceph -s
2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
bind the UNIX domain socket to 
'/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2) 
No such file or directory

 cluster:
   id: adb9ad8e-f458-4124-bf58-7963a8d1391f
   health: HEALTH_ERR
   3 pools have many more objects per pg than average
   373907/12391198 objects misplaced (3.018%)
   2 scrub errors
   9677 PGs pending on creation
   Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1 pg peering, 
   2717 pgs stale

   Possible data damage: 2 pgs inconsistent
   Degraded data redundancy: 178350/12391198 objects degraded (1.439%), 346 
   pgs degraded, 1297 pgs undersized

   52486 slow requests are blocked > 32 sec
   9287 stuck requests are blocked > 4096 sec
   too many PGs per OSD (2968 > max 200)

 services:
   mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
   mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
   osd: 39 osds: 36 up, 36 in; 51 remapped pgs
   rgw: 1 daemon active

 data:
   pools:   18 pools, 54656 pgs
   objects: 6050k objects, 10941 GB
   usage:   21727 GB used, 45308 GB / 67035 GB avail
   pgs: 13.073% pgs not active
178350/12391198 objects degraded (1.439%)
373907/12391198 objects misplaced (3.018%)
46177 active+clean
5054  down
1173  stale+down
1084  stale+active+undersized
547   activating
201   stale+active+undersized+degraded
158   stale+activating
96activating+degraded
46stale+active+clean
42activating+remapped
34stale+activating+degraded
23stale+activating+remapped
6 stale+activating+undersized+degraded+remapped
6 activating+undersized+degraded+remapped
2 activating+degraded+remapped
2 active+clean+inconsistent
1 stale+activating+degraded+remapped
1 stale+active+clean+remapped
1 stale+remapped
1 down+remapped
1 remapped+peering

 io:
   client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr

Thanks
--
Arun Poonia

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help Ceph Cluster Down

2019-01-03 Thread Arun POONIA
Hi,

Recently I tried adding a new node (OSD) to ceph cluster using ceph-deploy
tool. Since I was experimenting with tool and ended up deleting OSD nodes
on new server couple of times.

Now since ceph OSDs are running on new server cluster PGs seems to be
inactive (10-15%) and they are not recovering or rebalancing. Not sure what
to do. I tried shutting down OSDs on new server.

Status:
[root@fre105 ~]# ceph -s
2019-01-03 18:56:42.867081 7fa0bf573700 -1 asok(0x7fa0b80017a0)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph-guests/ceph-client.admin.4018644.140328258509136.asok': (2)
No such file or directory
  cluster:
id: adb9ad8e-f458-4124-bf58-7963a8d1391f
health: HEALTH_ERR
3 pools have many more objects per pg than average
373907/12391198 objects misplaced (3.018%)
2 scrub errors
9677 PGs pending on creation
Reduced data availability: 7145 pgs inactive, 6228 pgs down, 1
pg peering, 2717 pgs stale
Possible data damage: 2 pgs inconsistent
Degraded data redundancy: 178350/12391198 objects degraded
(1.439%), 346 pgs degraded, 1297 pgs undersized
52486 slow requests are blocked > 32 sec
9287 stuck requests are blocked > 4096 sec
too many PGs per OSD (2968 > max 200)

  services:
mon: 3 daemons, quorum ceph-mon01,ceph-mon02,ceph-mon03
mgr: ceph-mon03(active), standbys: ceph-mon01, ceph-mon02
osd: 39 osds: 36 up, 36 in; 51 remapped pgs
rgw: 1 daemon active

  data:
pools:   18 pools, 54656 pgs
objects: 6050k objects, 10941 GB
usage:   21727 GB used, 45308 GB / 67035 GB avail
pgs: 13.073% pgs not active
 178350/12391198 objects degraded (1.439%)
 373907/12391198 objects misplaced (3.018%)
 46177 active+clean
 5054  down
 1173  stale+down
 1084  stale+active+undersized
 547   activating
 201   stale+active+undersized+degraded
 158   stale+activating
 96activating+degraded
 46stale+active+clean
 42activating+remapped
 34stale+activating+degraded
 23stale+activating+remapped
 6 stale+activating+undersized+degraded+remapped
 6 activating+undersized+degraded+remapped
 2 activating+degraded+remapped
 2 active+clean+inconsistent
 1 stale+activating+degraded+remapped
 1 stale+active+clean+remapped
 1 stale+remapped
 1 down+remapped
 1 remapped+peering

  io:
client:   0 B/s rd, 208 kB/s wr, 28 op/s rd, 28 op/s wr

Thanks
-- 
Arun Poonia
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com