Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-27 Thread Chris
When your node went down, you lost 100% of the copies of the objects that 
were stored on that node, so the cluster had to re-create a copy of 
everything.  When the node came back online (and particularly since your 
usage was near-zero), the cluster discovered that many objects did not 
require changes and were still identical to their counterparts.  The only 
moved objects would have been ones that had changed and ones that needed to 
be moved in order to satisfy the requirements of your crush map for the 
purposes of distribution.


On January 27, 2019 09:47:59 Götz Reinicke  
wrote:

Dear all,

thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any 
complaining.


But wat makes me wonder is: During the downtime the Object got rebalanced 
and placed on the remaining nodes.


With the failed node online, only a couple of hundreds objects where 
misplaced, out of about 35 million.


The question for me is: What happens to the objects on the OSDs that went 
down after the OSDs got back online?


Thanks for feedback



Am 27.01.2019 um 04:17 schrieb Christian Balzer :


Hello,

this is where (depending on your topology) something like:
---
mon_osd_down_out_subtree_limit = host
---
can come in very handy.

Provided you have correct monitoring, alerting and operations, recovering
a down node can often be restored long before any recovery would be
finished and you also avoid the data movement back and forth.
And if you see that recovering the node will take a long time, just
manually set things out for the time being.

Christian

On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:


Dear Chris,

Thanks for your feedback. The node/OSDs in question are part of an erasure 
coded pool and during the weekend the workload should be close to none.


But anyway, I could get a look on the console and on the server; the power 
is up, but I cant use any console, the Loginprompt is shown, but no key is 
accepted.


I’ll have to reboot the server and check what he is complaining about 
tomorrow morning ASAP I can access the server again.


Fingers crossed and regards. Götz





Am 26.01.2019 um 23:41 schrieb Chris :

It sort of depends on your workload/use case.  Recovery operations can be 
computationally expensive.  If your load is light because its the weekend 
you should be able to turn that host back on  as soon as you resolve 
whatever the issue is with minimal impact.  You can also increase the 
priority of the recovery operation to make it go faster if you feel you can 
spare additional IO and it won't affect clients.


We do this in our cluster regularly and have yet to see an issue (given 
that we take care to do it during periods of lower client io)


On January 26, 2019 17:16:38 Götz Reinicke  
wrote:




Hi,

one host out of 10 is down for yet unknown reasons. I guess a power 
failure. I could not yet see the server.


The Cluster is recovering and remapping fine, but still has some objects to 
process.


My question: May I just switch the server back on and in best case, the 24 
OSDs get back online and recovering will do the job without problems.


Or what might be a good way to handle that host? Should I first wait till 
the recover is finished?


Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications



Götz Reinicke
IT-Koordinator
IT-OfficeNet
+49 7141 969 82420
goetz.reini...@filmakademie.de

Filmakademie Baden-Württemberg GmbH
Akademiehof 10
71638 Ludwigsburg
http://www.filmakademie.de


Eintragung Amtsgericht Stuttgart HRB 205016
Vorsitzende des Aufsichtsrates:
Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer:
Prof. Thomas Schadt

Datenschutzerklärung | Transparenzinformation
Data privacy statement | Transparency information


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-27 Thread Götz Reinicke
Dear all,

thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any 
complaining.

But wat makes me wonder is: During the downtime the Object got rebalanced and 
placed on the remaining nodes.

With the failed node online, only a couple of hundreds objects where misplaced, 
out of about 35 million.

The question for me is: What happens to the objects on the OSDs that went down 
after the OSDs got back online?

Thanks for feedback 


> Am 27.01.2019 um 04:17 schrieb Christian Balzer :
> 
> 
> Hello,
> 
> this is where (depending on your topology) something like:
> ---
> mon_osd_down_out_subtree_limit = host
> ---
> can come in very handy.
> 
> Provided you have correct monitoring, alerting and operations, recovering
> a down node can often be restored long before any recovery would be
> finished and you also avoid the data movement back and forth.
> And if you see that recovering the node will take a long time, just
> manually set things out for the time being.
> 
> Christian
> 
> On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:
> 
>> Dear Chris,
>> 
>> Thanks for your feedback. The node/OSDs in question are part of an erasure 
>> coded pool and during the weekend the workload should be close to none.
>> 
>> But anyway, I could get a look on the console and on the server; the power 
>> is up, but I cant use any console, the Loginprompt is shown, but no key is 
>> accepted.
>> 
>> I’ll have to reboot the server and check what he is complaining about 
>> tomorrow morning ASAP I can access the server again.
>> 
>>  Fingers crossed and regards. Götz
>> 
>> 
>> 
>>> Am 26.01.2019 um 23:41 schrieb Chris :
>>> 
>>> It sort of depends on your workload/use case.  Recovery operations can be 
>>> computationally expensive.  If your load is light because its the weekend 
>>> you should be able to turn that host back on  as soon as you resolve 
>>> whatever the issue is with minimal impact.  You can also increase the 
>>> priority of the recovery operation to make it go faster if you feel you can 
>>> spare additional IO and it won't affect clients.
>>> 
>>> We do this in our cluster regularly and have yet to see an issue (given 
>>> that we take care to do it during periods of lower client io)
>>> 
>>> On January 26, 2019 17:16:38 Götz Reinicke  
>>> wrote:
>>> 
 Hi,
 
 one host out of 10 is down for yet unknown reasons. I guess a power 
 failure. I could not yet see the server.
 
 The Cluster is recovering and remapping fine, but still has some objects 
 to process.
 
 My question: May I just switch the server back on and in best case, the 24 
 OSDs get back online and recovering will do the job without problems.
 
 Or what might be a good way to handle that host? Should I first wait till 
 the recover is finished?
 
 Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
 Götz  
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications

     
Götz Reinicke 
IT-Koordinator
IT-OfficeNet
+49 7141 969 82420 
goetz.reini...@filmakademie.de 
Filmakademie Baden-Württemberg GmbH 
Akademiehof 10
71638 Ludwigsburg 
http://www.filmakademie.de 
   
 
  
 
Eintragung Amtsgericht Stuttgart HRB 205016
Vorsitzende des Aufsichtsrates:
Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg
Geschäftsführer:
Prof. Thomas Schadt

Datenschutzerklärung 
 | 
Transparenzinformation 

Data privacy statement 
 | 
Transparency information 




smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Christian Balzer

Hello,

this is where (depending on your topology) something like:
---
mon_osd_down_out_subtree_limit = host
---
can come in very handy.

Provided you have correct monitoring, alerting and operations, recovering
a down node can often be restored long before any recovery would be
finished and you also avoid the data movement back and forth.
And if you see that recovering the node will take a long time, just
manually set things out for the time being.

Christian

On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:

> Dear Chris,
> 
> Thanks for your feedback. The node/OSDs in question are part of an erasure 
> coded pool and during the weekend the workload should be close to none.
> 
> But anyway, I could get a look on the console and on the server; the power is 
> up, but I cant use any console, the Loginprompt is shown, but no key is 
> accepted.
> 
> I’ll have to reboot the server and check what he is complaining about 
> tomorrow morning ASAP I can access the server again.
> 
>   Fingers crossed and regards. Götz
> 
> 
> 
> > Am 26.01.2019 um 23:41 schrieb Chris :
> > 
> > It sort of depends on your workload/use case.  Recovery operations can be 
> > computationally expensive.  If your load is light because its the weekend 
> > you should be able to turn that host back on  as soon as you resolve 
> > whatever the issue is with minimal impact.  You can also increase the 
> > priority of the recovery operation to make it go faster if you feel you can 
> > spare additional IO and it won't affect clients.
> > 
> > We do this in our cluster regularly and have yet to see an issue (given 
> > that we take care to do it during periods of lower client io)
> > 
> > On January 26, 2019 17:16:38 Götz Reinicke  
> > wrote:
> >   
> >> Hi,
> >> 
> >> one host out of 10 is down for yet unknown reasons. I guess a power 
> >> failure. I could not yet see the server.
> >> 
> >> The Cluster is recovering and remapping fine, but still has some objects 
> >> to process.
> >> 
> >> My question: May I just switch the server back on and in best case, the 24 
> >> OSDs get back online and recovering will do the job without problems.
> >> 
> >> Or what might be a good way to handle that host? Should I first wait till 
> >> the recover is finished?
> >> 
> >> Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
> >> Götz  
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Brian Topping
I went through this as I reformatted all the OSDs with a much smaller cluster 
last weekend. When turning nodes back on, PGs would sometimes move, only to 
move back, prolonging the operation and system stress. 

What I took away is it’s least overall system stress to have the OSD tree back 
to target state as quickly as safe and practical. Replication will happen as 
replication will, but if the strategy changes midway, it just means the same 
speed of movement over a longer time. 

> On Jan 26, 2019, at 15:41, Chris  wrote:
> 
> It sort of depends on your workload/use case.  Recovery operations can be 
> computationally expensive.  If your load is light because its the weekend you 
> should be able to turn that host back on  as soon as you resolve whatever the 
> issue is with minimal impact.  You can also increase the priority of the 
> recovery operation to make it go faster if you feel you can spare additional 
> IO and it won't affect clients.
> 
> We do this in our cluster regularly and have yet to see an issue (given that 
> we take care to do it during periods of lower client io)
> 
>> On January 26, 2019 17:16:38 Götz Reinicke  
>> wrote:
>> 
>> Hi,
>> 
>> one host out of 10 is down for yet unknown reasons. I guess a power failure. 
>> I could not yet see the server.
>> 
>> The Cluster is recovering and remapping fine, but still has some objects to 
>> process.
>> 
>> My question: May I just switch the server back on and in best case, the 24 
>> OSDs get back online and recovering will do the job without problems.
>> 
>> Or what might be a good way to handle that host? Should I first wait till 
>> the recover is finished?
>> 
>> Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
>> Götz
>> 
>> 
>> --
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Götz Reinicke
Dear Chris,

Thanks for your feedback. The node/OSDs in question are part of an erasure 
coded pool and during the weekend the workload should be close to none.

But anyway, I could get a look on the console and on the server; the power is 
up, but I cant use any console, the Loginprompt is shown, but no key is 
accepted.

I’ll have to reboot the server and check what he is complaining about tomorrow 
morning ASAP I can access the server again.

Fingers crossed and regards. Götz



> Am 26.01.2019 um 23:41 schrieb Chris :
> 
> It sort of depends on your workload/use case.  Recovery operations can be 
> computationally expensive.  If your load is light because its the weekend you 
> should be able to turn that host back on  as soon as you resolve whatever the 
> issue is with minimal impact.  You can also increase the priority of the 
> recovery operation to make it go faster if you feel you can spare additional 
> IO and it won't affect clients.
> 
> We do this in our cluster regularly and have yet to see an issue (given that 
> we take care to do it during periods of lower client io)
> 
> On January 26, 2019 17:16:38 Götz Reinicke  
> wrote:
> 
>> Hi,
>> 
>> one host out of 10 is down for yet unknown reasons. I guess a power failure. 
>> I could not yet see the server.
>> 
>> The Cluster is recovering and remapping fine, but still has some objects to 
>> process.
>> 
>> My question: May I just switch the server back on and in best case, the 24 
>> OSDs get back online and recovering will do the job without problems.
>> 
>> Or what might be a good way to handle that host? Should I first wait till 
>> the recover is finished?
>> 
>> Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . 
>> Götz



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Chris
It sort of depends on your workload/use case.  Recovery operations can be 
computationally expensive.  If your load is light because its the weekend 
you should be able to turn that host back on  as soon as you resolve 
whatever the issue is with minimal impact.  You can also increase the 
priority of the recovery operation to make it go faster if you feel you can 
spare additional IO and it won't affect clients.


We do this in our cluster regularly and have yet to see an issue (given 
that we take care to do it during periods of lower client io)


On January 26, 2019 17:16:38 Götz Reinicke  
wrote:



Hi,

one host out of 10 is down for yet unknown reasons. I guess a power 
failure. I could not yet see the server.


The Cluster is recovering and remapping fine, but still has some objects to 
process.


My question: May I just switch the server back on and in best case, the 24 
OSDs get back online and recovering will do the job without problems.


Or what might be a good way to handle that host? Should I first wait till 
the recover is finished?


Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz


--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] One host with 24 OSDs is offline - best way to get it back online

2019-01-26 Thread Götz Reinicke
Hi,

one host out of 10 is down for yet unknown reasons. I guess a power failure. I 
could not yet see the server.

The Cluster is recovering and remapping fine, but still has some objects to 
process.

My question: May I just switch the server back on and in best case, the 24 OSDs 
get back online and recovering will do the job without problems.

Or what might be a good way to handle that host? Should I first wait till the 
recover is finished?

Thanks for feedback and suggestions - Happy Saturday Night  :) . 
Regards . Götz

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com