Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-24 Thread Christian Balzer
On Thu, 25 Jul 2019 13:49:22 +0900 Sangwhan Moon wrote:

> osd: 39 osds: 39 up, 38 in

You might want to find that out OSD.

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Mobile Inc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-24 Thread Sangwhan Moon
Hello,

Original Message:
> 
> 
> On 7/25/19 6:49 AM, Sangwhan Moon wrote:
> > Hello,
> > 
> > I've inherited a Ceph cluster from someone who has left zero documentation 
> > or any handover. A couple days ago it decided to show the entire company 
> > what it is capable of..
> > 
> > The health report looks like this:
> > 
> > [root@host mnt]# ceph -s
> >   cluster:
> > id: 809718aa-3eac-4664-b8fa-38c46cdbfdab
> > health: HEALTH_ERR
> > 1 MDSs report damaged metadata
> > 1 MDSs are read only
> > 2 MDSs report slow requests
> > 6 MDSs behind on trimming
> > Reduced data availability: 2 pgs stale
> > Degraded data redundancy: 2593/186803520 objects degraded 
> > (0.001%), 2 pgs degraded, 2 pgs undersized
> > 1 slow requests are blocked > 32 sec. Implicated osds
> > 716 stuck requests are blocked > 4096 sec. Implicated osds 
> > 25,31,38\
> 
> I would start here:
> 
> > 
> >   services:
> > mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
> > mgr: a(active)
> > mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  
> > {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
> > 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active},
> >  4 up:sta
> > ndby-replay
> > osd: 39 osds: 39 up, 38 in
> > 
> >   data:
> > pools:   5 pools, 706 pgs
> > objects: 91212k objects, 4415 GB
> > usage:   10415 GB used, 13024 GB / 23439 GB avail
> > pgs: 2593/186803520 objects degraded (0.001%)
> >  703 active+clean
> >  2   stale+active+undersized+degraded
> 
> This is a problem! Can you check:
> 
> $ ceph pg dump_stuck
> 
> The PGs will start with a number like 8.1a where '8' it the pool ID.
> 
> Then check:
> 
> $ ceph df
> 
> To which pools to those PGs belong?
> 
> Then check:
> 
> $ ceph pg  query
> 
> And the bottom somewhere should show why these PGs are not active. You
> might even want to try a restart of these OSDs involved with those two PGs.

Thanks a lot for the suggestions - I just checked and it says that the 
problematic PGs are 4.4f and 4.59 - but querying those seem result in the 
following error:

Error ENOENT: i don't have pgid 4.4f

(same applies for 4.59 - they do seem to show up in "ceph pg ls" though.)

In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, 
ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although 
both the PG ID mentioned above and these numbers probably don't help much with 
the diagnosis) Should restarting be a safe thing to try first?

ceph health detail says the following:

MDS_DAMAGE 1 MDSs report damaged metadata
mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected
MDS_READ_ONLY 1 MDSs are read only
mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode
MDS_SLOW_REQUEST 2 MDSs report slow requests
mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec
mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 sec
MDS_TRIM 6 MDSs behind on trimming
mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) 
max_segments: 128, num_segments: 342
mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) 
max_segments: 128, num_segments: 461
mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) 
max_segments: 128, num_segments: 342
mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) 
max_segments: 128, num_segments: 461
mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) 
max_segments: 128, num_segments: 386
mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) 
max_segments: 128, num_segments: 386
PG_AVAILABILITY Reduced data availability: 2 pgs stale
pg 4.4f is stuck stale for 171783.855465, current state 
stale+active+undersized+degraded, last acting [24]
pg 4.59 is stuck stale for 171751.961506, current state 
stale+active+undersized+degraded, last acting [13]
PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded (0.001%), 
2 pgs degraded, 2 pgs undersized
pg 4.4f is stuck undersized for 171797.245359, current state 
stale+active+undersized+degraded, last acting [24]
pg 4.59 is stuck undersized for 171797.257707, current state 
stale+active+undersized+degraded, last acting [13]
REQUEST_SLOW 3 slow requests are blocked > 32 sec. Implicated osds
3 ops are blocked > 2097.15 sec
REQUEST_STUCK 717 stuck requests are blocked > 4096 sec. Implicated osds 
25,31,38
286 ops are blocked > 268435 sec
211 ops are blocked > 134218 sec
5 ops are blocked > 67108.9 sec
2 ops are blocked > 33554.4 sec
134 ops are blocked > 16777.2 sec
79 ops are blocked > 8388.61 sec
osds 25,31,38 have stuck requests > 268435 sec

Cheers,
Sangwhan

> 
> Wido
> 
> >  1   active+clean+scrubbing+deep
> > 
> >   io:
> > client:   168 

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kaspar Bosma

+1 on that. We are going to add 384 OSDs next week to a 2K+ cluster. The proposed solution really works well!KasparOp 24 juli 2019 om 21:06 schreef Paul Emmerich :  +1 on adding them all at the same time.All these methods that gradually increase the weight aren't really necessary in newer releases of Ceph. Paul-- Paul Emmerich  Looking for help with your Ceph cluster? Contact us at https://croit.io  croit GmbH Freseniusstr. 31h 81247 München  www.croit.io Tel: +49 89 1896585 90On Wed, Jul 24, 2019 at 8:59 PM Reed Dier < reed.d...@focusvq.com> wrote: Just chiming in to say that this too has been my preferred method for adding [large numbers of] OSDs.Set the norebalance nobackfill flags.Create all the OSDs, and verify everything looks good.Make sure my max_backfills, recovery_max_active are as expected.Make sure everything has peered.Unset flags and let it run.One crush map change, one data movement.Reed  That works, but with newer releases I've been doing this:   - Make sure cluster is HEALTH_OK  - Set the 'norebalance' flag (and usually nobackfill)  - Add all the OSDs  - Wait for the PGs to peer. I usually wait a few minutes  - Remove the norebalance and nobackfill flag  - Wait for HEALTH_OK   Wido   ___  ceph-users mailing list  ceph-users@lists.ceph.com  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___  ceph-users mailing list  ceph-users@lists.ceph.com  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-24 Thread Wido den Hollander



On 7/25/19 6:49 AM, Sangwhan Moon wrote:
> Hello,
> 
> I've inherited a Ceph cluster from someone who has left zero documentation or 
> any handover. A couple days ago it decided to show the entire company what it 
> is capable of..
> 
> The health report looks like this:
> 
> [root@host mnt]# ceph -s
>   cluster:
> id: 809718aa-3eac-4664-b8fa-38c46cdbfdab
> health: HEALTH_ERR
> 1 MDSs report damaged metadata
> 1 MDSs are read only
> 2 MDSs report slow requests
> 6 MDSs behind on trimming
> Reduced data availability: 2 pgs stale
> Degraded data redundancy: 2593/186803520 objects degraded 
> (0.001%), 2 pgs degraded, 2 pgs undersized
> 1 slow requests are blocked > 32 sec. Implicated osds
> 716 stuck requests are blocked > 4096 sec. Implicated osds 
> 25,31,38\

I would start here:

> 
>   services:
> mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
> mgr: a(active)
> mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  
> {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active},
>  4 up:sta
> ndby-replay
> osd: 39 osds: 39 up, 38 in
> 
>   data:
> pools:   5 pools, 706 pgs
> objects: 91212k objects, 4415 GB
> usage:   10415 GB used, 13024 GB / 23439 GB avail
> pgs: 2593/186803520 objects degraded (0.001%)
>  703 active+clean
>  2   stale+active+undersized+degraded

This is a problem! Can you check:

$ ceph pg dump_stuck

The PGs will start with a number like 8.1a where '8' it the pool ID.

Then check:

$ ceph df

To which pools to those PGs belong?

Then check:

$ ceph pg  query

And the bottom somewhere should show why these PGs are not active. You
might even want to try a restart of these OSDs involved with those two PGs.

Wido

>  1   active+clean+scrubbing+deep
> 
>   io:
> client:   168 kB/s rd, 6336 B/s wr, 10 op/s rd, 1 op/s wr
> 
> The offending broken MDS entry (damaged metadata) seems to be this:
> 
> mds.ceph-fs-5b997cbf7b-5tjwh: [
> {
> "damage_type": "dir_frag",
> "id": 1190692215,
> "ino": 2199023258131,
> "frag": "*",
> "path": "/f/01/59"
> }
> ]
> 
> Is there any idea how I can diagnose and find out what is wrong? For the 
> other issues I'm not even sure what/where I need to look into.
> 
> Cheers,
> Sangwhan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth

2019-07-24 Thread Sangwhan Moon
Hello,

I've inherited a Ceph cluster from someone who has left zero documentation or 
any handover. A couple days ago it decided to show the entire company what it 
is capable of..

The health report looks like this:

[root@host mnt]# ceph -s
  cluster:
id: 809718aa-3eac-4664-b8fa-38c46cdbfdab
health: HEALTH_ERR
1 MDSs report damaged metadata
1 MDSs are read only
2 MDSs report slow requests
6 MDSs behind on trimming
Reduced data availability: 2 pgs stale
Degraded data redundancy: 2593/186803520 objects degraded (0.001%), 
2 pgs degraded, 2 pgs undersized
1 slow requests are blocked > 32 sec. Implicated osds
716 stuck requests are blocked > 4096 sec. Implicated osds 25,31,38

  services:
mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0
mgr: a(active)
mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up  
{[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf
7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active},
 4 up:sta
ndby-replay
osd: 39 osds: 39 up, 38 in

  data:
pools:   5 pools, 706 pgs
objects: 91212k objects, 4415 GB
usage:   10415 GB used, 13024 GB / 23439 GB avail
pgs: 2593/186803520 objects degraded (0.001%)
 703 active+clean
 2   stale+active+undersized+degraded
 1   active+clean+scrubbing+deep

  io:
client:   168 kB/s rd, 6336 B/s wr, 10 op/s rd, 1 op/s wr

The offending broken MDS entry (damaged metadata) seems to be this:

mds.ceph-fs-5b997cbf7b-5tjwh: [
{
"damage_type": "dir_frag",
"id": 1190692215,
"ino": 2199023258131,
"frag": "*",
"path": "/f/01/59"
}
]

Is there any idea how I can diagnose and find out what is wrong? For the other 
issues I'm not even sure what/where I need to look into.

Cheers,
Sangwhan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread zhanrzh...@teamsun.com.cn
I think it should to set "osd_pool_default_min_size=1" before you add osd ,
and the osd that you add  at a time  should in same Failure domain.



Hi,
What would be the proper way to add 100 new OSDs to a cluster?
I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.
Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.
Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.
Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.
Thanks!
Xavier.
 



zhanrzh...@teamsun.com.cn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Виталий Филиппов
Cache=writeback is perfectly safe, it's flushed when the guest calls fsync, so 
journaled filesystems and databases don't lose data that's committed to the 
journal.

25 июля 2019 г. 2:28:26 GMT+03:00, Stuart Longland  
пишет:
>On 25/7/19 9:01 am, vita...@yourcmc.ru wrote:
>>> 60 millibits per second?  60 bits every 1000 seconds?  Are you
>serious?
>>>  Or did we get the capitalisation wrong?
>>>
>>> Assuming 60MB/sec (as 60 Mb/sec would still be slower than the
>5MB/sec I
>>> was getting), maybe there's some characteristic that Bluestore is
>>> particularly dependent on regarding the HDDs.
>>>
>>> I'll admit right up front the drives I'm using were chosen because
>they
>>> were all I could get with a 2TB storage capacity for a reasonable
>price.
>>>
>>> I'm not against moving to Bluestore, however, I think I need to
>research
>>> it better to understand why the performance I was getting before was
>so
>>> poor.
>> 
>> It's a nano-ceph! So millibits :) I mean 60 megabytes per second, of 
>> course. My drives are also crap. I just want to say that you probably
>
>> miss some option for your VM, for example "cache=writeback".
>
>cache=writeback should have no effect on read performance but could be 
>quite dangerous if the VM host were to go down immediately after a
>write 
>for any reason.
>
>While 60MB/sec is getting respectable, doing so at the cost of data 
>safety is not something I'm keen on.
>-- 
>Stuart Longland (aka Redhatter, VK4MSL)
>
>I haven't lost my mind...
>   ...it's backed up on a tape somewhere.

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-24 Thread Xavier Trilla
Hi,

We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD 
in order to be able to use -battery protected- write cache from the RAID 
controller. It really improves performance, for both bluestore and filestore 
OSDs.

We also avoid expanders as we had bad experiences with them.

Xavier 


-Mensaje original-
De: ceph-users  En nombre de Simon Ironside
Enviado el: jueves, 25 de julio de 2019 0:38
Para: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] New best practices for osds???

RAID0 mode being discussed here means several RAID0 "arrays", each with a 
single physical disk as a member of it.
I.e. the number of OSDs is the same whether in RAID0 or JBOD mode.
E.g. 12x physicals disks = 12x RAID0 single disk "arrays" or 12x JBOD physical 
disks = 12x OSDs.

Simon

On 24/07/2019 23:14, solarflow99 wrote:
> I can't understand how using RAID0 is better than JBOD, considering 
> jbod would be many individual disks, each used as OSDs, instead of a 
> single big one used as a single OSD.
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Stuart Longland

On 25/7/19 9:01 am, vita...@yourcmc.ru wrote:

60 millibits per second?  60 bits every 1000 seconds?  Are you serious?
 Or did we get the capitalisation wrong?

Assuming 60MB/sec (as 60 Mb/sec would still be slower than the 5MB/sec I
was getting), maybe there's some characteristic that Bluestore is
particularly dependent on regarding the HDDs.

I'll admit right up front the drives I'm using were chosen because they
were all I could get with a 2TB storage capacity for a reasonable price.

I'm not against moving to Bluestore, however, I think I need to research
it better to understand why the performance I was getting before was so
poor.


It's a nano-ceph! So millibits :) I mean 60 megabytes per second, of 
course. My drives are also crap. I just want to say that you probably 
miss some option for your VM, for example "cache=writeback".


cache=writeback should have no effect on read performance but could be 
quite dangerous if the VM host were to go down immediately after a write 
for any reason.


While 60MB/sec is getting respectable, doing so at the cost of data 
safety is not something I'm keen on.

--
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread vitalif

60 millibits per second?  60 bits every 1000 seconds?  Are you serious?
 Or did we get the capitalisation wrong?

Assuming 60MB/sec (as 60 Mb/sec would still be slower than the 5MB/sec 
I

was getting), maybe there's some characteristic that Bluestore is
particularly dependent on regarding the HDDs.

I'll admit right up front the drives I'm using were chosen because they
were all I could get with a 2TB storage capacity for a reasonable 
price.


I'm not against moving to Bluestore, however, I think I need to 
research

it better to understand why the performance I was getting before was so
poor.


It's a nano-ceph! So millibits :) I mean 60 megabytes per second, of 
course. My drives are also crap. I just want to say that you probably 
miss some option for your VM, for example "cache=writeback".


The exact commandline I used to start my test VM was:

kvm -m 1024 -drive format=rbd,file=rbd:rpool/debian10-1,cache=writeback 
-vnc 0.0.0.0:0 -netdev user,id=mn -device virtio-net,netdev=mn

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Stuart Longland
On 25/7/19 8:48 am, Vitaliy Filippov wrote:
> I get 60 mb/s inside a VM in my home nano-ceph consisting of 5 HDDs 4 of
> which are inside one PC and 5th is plugged into a ROCK64 :)) I use
> Bluestore...

60 millibits per second?  60 bits every 1000 seconds?  Are you serious?
 Or did we get the capitalisation wrong?

Assuming 60MB/sec (as 60 Mb/sec would still be slower than the 5MB/sec I
was getting), maybe there's some characteristic that Bluestore is
particularly dependent on regarding the HDDs.

I'll admit right up front the drives I'm using were chosen because they
were all I could get with a 2TB storage capacity for a reasonable price.

I'm not against moving to Bluestore, however, I think I need to research
it better to understand why the performance I was getting before was so
poor.
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Vitaliy Filippov

/dev/vdb:
 Timing cached reads:   2556 MB in  1.99 seconds = 1281.50 MB/sec
 Timing buffered disk reads:  62 MB in  3.03 seconds =  20.48 MB/sec


That is without any special tuning, just migrating back to FileStore…
journal is on the HDD (it wouldn't let me put it on the SSD like it did
last time).

As I say, not going to set the world on fire, but 20MB/sec is quite
usable for my needs.  The 4× speed increase is very welcome!


I get 60 mb/s inside a VM in my home nano-ceph consisting of 5 HDDs 4 of  
which are inside one PC and 5th is plugged into a ROCK64 :)) I use  
Bluestore...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-24 Thread Simon Ironside
RAID0 mode being discussed here means several RAID0 "arrays", each with 
a single physical disk as a member of it.

I.e. the number of OSDs is the same whether in RAID0 or JBOD mode.
E.g. 12x physicals disks = 12x RAID0 single disk "arrays" or 12x JBOD 
physical disks = 12x OSDs.


Simon

On 24/07/2019 23:14, solarflow99 wrote:
I can't understand how using RAID0 is better than JBOD, considering 
jbod would be many individual disks, each used as OSDs, instead of a 
single big one used as a single OSD.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-24 Thread Vitaliy Filippov

One RAID0 array per drive :)


I can't understand how using RAID0 is better than JBOD, considering jbod
would be many individual disks, each used as OSDs, instead of a single  
big one used as a single OSD.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-24 Thread solarflow99
I can't understand how using RAID0 is better than JBOD, considering jbod
would be many individual disks, each used as OSDs, instead of a single big
one used as a single OSD.



On Mon, Jul 22, 2019 at 4:05 AM Vitaliy Filippov  wrote:

> OK, I meant "it may help performance" :) the main point is that we had at
> least one case of data loss due to some Adaptec controller in RAID0 mode
> discussed recently in our ceph chat...
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-users] Re: MDS failing under load with large cache sizes

2019-07-24 Thread Patrick Donnelly
+ other ceph-users

On Wed, Jul 24, 2019 at 10:26 AM Janek Bevendorff
 wrote:
>
> > what's the ceph.com mailing list? I wondered whether this list is dead but 
> > it's the list announced on the official ceph.com homepage, isn't it?
> There are two mailing lists announced on the website. If you go to
> https://ceph.com/resources/ you will find the
> subscribe/unsubscribe/archive links for the (much more active) ceph.com
> MLs. But if you click on "Mailing Lists & IRC page" you will get to a
> page where you can subscribe to this list, which is different. Very
> confusing.

It is confusing. This is supposed to be the new ML but I don't think
the migration has started yet.

> > What did you have the MDS cache size set to at the time?
> >
> > < and an inode count between
>
> I actually did not think I'd get a reply here. We are a bit further than
> this on the other mailing list. This is the thread:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036095.html
>
> To sum it up: the ceph client prevents the MDS from freeing its cache,
> so inodes keep piling up until either the MDS becomes too slow (fixable
> by increasing the beacon grace time) or runs out of memory. The latter
> will happen eventually. In the end, my MDSs couldn't even rejoin because
> they hit the host's 128GB memory limit and crashed.

It's possible the MDS is not being aggressive enough with asking the
single (?) client to reduce its cache size. There were recent changes
[1] to the MDS to improve this. However, the defaults may not be
aggressive enough for your client's workload. Can you try:

ceph config set mds mds_recall_max_caps 1
ceph config set mds mds_recall_max_decay_rate 1.0

Also your other mailings made me think you may still be using the old
inode limit for the cache size. Are you using the new
mds_cache_memory_limit config option?

Finally, if this fixes your issue (please let us know!) and you decide
to try multiple active MDS, you should definitely use pinning as the
parallel create workload will greatly benefit from it.

[1] https://ceph.com/community/nautilus-cephfs/

--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Future of Filestore?

2019-07-24 Thread Stuart Longland
On 23/7/19 9:59 pm, Stuart Longland wrote:
> I'll do some proper measurements once the migration is complete.

A starting point (I accept more rigorous disk storage tests exist):
> virtatomos ~ # hdparm -tT /dev/vdb
> 
> /dev/vdb:
>  Timing cached reads:   2556 MB in  1.99 seconds = 1281.50 MB/sec
>  Timing buffered disk reads:  62 MB in  3.03 seconds =  20.48 MB/sec

That is without any special tuning, just migrating back to FileStore…
journal is on the HDD (it wouldn't let me put it on the SSD like it did
last time).

As I say, not going to set the world on fire, but 20MB/sec is quite
usable for my needs.  The 4× speed increase is very welcome!
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Alfredo Deza
On Wed, Jul 24, 2019 at 4:15 PM Peter Eisch 
wrote:

> Hi,
>
>
>
> I appreciate the insistency that the directions be followed.  I wholly
> agree.  The only liberty I took was to do a ‘yum update’ instead of just
> ‘yum update ceph-osd’ and then reboot.  (Also my MDS runs on the MON hosts,
> so it got update a step early.)
>
>
>
> As for the logs:
>
>
>
> [2019-07-24 15:07:22,713][ceph_volume.main][INFO  ] Running command:
> ceph-volume  simple scan
>
> [2019-07-24 15:07:22,714][ceph_volume.process][INFO  ] Running command:
> /bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*
>
> [2019-07-24 15:07:27,574][ceph_volume.main][INFO  ] Running command:
> ceph-volume  simple activate --all
>
> [2019-07-24 15:07:27,575][ceph_volume.devices.simple.activate][INFO  ]
> activating OSD specified in
> /etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
>
> [2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ]
> Required devices (block and data) not present for bluestore
>
> [2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ]
> bluestore devices found: [u'data']
>
> [2019-07-24 15:07:27,576][ceph_volume][ERROR ] exception caught by
> decorator
>
> Traceback (most recent call last):
>
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line
> 59, in newfunc
>
> return f(*a, **kw)
>
>   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148,
> in main
>
> terminal.dispatch(self.mapper, subcommand_args)
>
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line
> 182, in dispatch
>
> instance.main()
>
>   File
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", line
> 33, in main
>
> terminal.dispatch(self.mapper, self.argv)
>
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line
> 182, in dispatch
>
> instance.main()
>
>   File
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
> line 272, in main
>
> self.activate(args)
>
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line
> 16, in is_root
>
> return func(*a, **kw)
>
>   File
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
> line 131, in activate
>
> self.validate_devices(osd_metadata)
>
>   File
> "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py",
> line 62, in validate_devices
>
> raise RuntimeError('Unable to activate bluestore OSD due to missing
> devices')
>
> RuntimeError: Unable to activate bluestore OSD due to missing devices
>
>
>
> (this is repeated for each of the 16 drives)
>
>
>
> Any other thoughts?  (I’ll delete/create the OSDs with ceph-deply
> otherwise.)
>

Try using `ceph-volume simple scan --stdout` so that it doesn't persist
data onto /etc/ceph/osd/ and inspect that the JSON produced is capturing
all the necessary details for OSDs.

Alternatively, I would look into the JSON files already produced in
/etc/ceph/osd/ and check if the details are correct. The `scan` sub-command
does a tremendous effort to cover all cases where ceph-disk
created an OSD (filestore, bluestore, dmcrypt, etc...) but it is possible
that it may be hitting a problem. This is why the tool made these JSON
files available, so that they could be inspected and corrected if anything.

The details of the scan sub-command can be found at
http://docs.ceph.com/docs/master/ceph-volume/simple/scan/ and the JSON
structure is described in detail below at
http://docs.ceph.com/docs/master/ceph-volume/simple/scan/#json-contents

In this particular case the tool is refusing to activate what seems to be a
bluestore OSD. Is it really a bluestore OSD? if so, then it can't find
where is the data partition. What does that partition look like (for any of
the failing OSDs) ? Does it use dmcrypt, how was it created? (hopefully
with ceph-disk!)

If you know the data partition for a given OSD, try and pass it onto
'scan'. For example if it is /dev/sda1 you could do `ceph-volume simple
scan /dev/sda1` and check its output.



>
> peter
>
>
>
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the 

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi,

I appreciate the insistency that the directions be followed.  I wholly agree.  
The only liberty I took was to do a ‘yum update’ instead of just ‘yum update 
ceph-osd’ and then reboot.  (Also my MDS runs on the MON hosts, so it got 
update a step early.)

As for the logs:

[2019-07-24 15:07:22,713][ceph_volume.main][INFO  ] Running command: 
ceph-volume  simple scan
[2019-07-24 15:07:22,714][ceph_volume.process][INFO  ] Running command: 
/bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*
[2019-07-24 15:07:27,574][ceph_volume.main][INFO  ] Running command: 
ceph-volume  simple activate --all
[2019-07-24 15:07:27,575][ceph_volume.devices.simple.activate][INFO  ] 
activating OSD specified in 
/etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
[2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ] Required 
devices (block and data) not present for bluestore
[2019-07-24 15:07:27,576][ceph_volume.devices.simple.activate][ERROR ] 
bluestore devices found: [u'data']
[2019-07-24 15:07:27,576][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
in newfunc
return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", 
line 33, in main
terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
272, in main
self.activate(args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
131, in activate
self.validate_devices(osd_metadata)
  File 
"/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/activate.py", line 
62, in validate_devices
raise RuntimeError('Unable to activate bluestore OSD due to missing 
devices')
RuntimeError: Unable to activate bluestore OSD due to missing devices

(this is repeated for each of the 16 drives)

Any other thoughts?  (I’ll delete/create the OSDs with ceph-deply otherwise.)

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Alfredo Deza 
Date: Wednesday, July 24, 2019 at 3:02 PM
To: Peter Eisch 
Cc: Paul Emmerich , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs


On Wed, Jul 24, 2019 at 3:49 PM Peter Eisch 
mailto:peter.ei...@virginpulse.com>> wrote:

I’m at step 6.  I updated/rebooted the host to complete “installing the new 
packages and restarting the ceph-osd daemon” on the first OSD host.  All the 
systemctl definitions to start the OSDs were deleted, all the properties in 
/var/lib/ceph/osd/ceph-* directories were deleted.  All the files in 
/var/lib/ceph/osd-lockbox, for comparison, were untouched and still present.

Peeking into step 7 I can run ceph-volume:

# ceph-volume simple scan /dev/sda1
Running command: /usr/sbin/cryptsetup status /dev/sda1
Running command: /usr/sbin/cryptsetup status 
93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/sda5 /tmp/tmpF5F8t2
stdout: mount: /dev/sda5 mounted on /tmp/tmpF5F8t2.
Running command: /usr/sbin/cryptsetup status /dev/sda5
Running command: /bin/ceph --cluster ceph --name 
client.osd-lockbox.93fb5f2f-0273-4c87-a718-886d7e6db983 --keyring 
/tmp/tmpF5F8t2/keyring config-key get 
dm-crypt/osd/93fb5f2f-0273-4c87-a718-886d7e6db983/luks
Running command: /bin/umount -v /tmp/tmpF5F8t2
stderr: umount: /tmp/tmpF5F8t2 (/dev/sda5) unmounted
Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen 
/dev/sda1 93fb5f2f-0273-4c87-a718-886d7e6db983

Re: [ceph-users] Kernel, Distro & Ceph

2019-07-24 Thread Wido den Hollander



On 7/24/19 9:38 PM, dhils...@performair.com wrote:
> All;
> 
> There's been a lot of discussion of various kernel versions on this list 
> lately, so I thought I'd seek some clarification.
> 
> I prefer to run CentOS, and I prefer to keep the number of "extra" 
> repositories to a minimum.  Ceph requires adding a Ceph repo, and the EPEL 
> repo.  Updating the kernel requires (from the research I've done) adding 
> EL-Repo.  I believe CentOS 7 uses the 3.10 kernel.
> 

Are you planning on using CephFS? Because only on the clients using
CephFS through the kernel you might require a new kernel.

The nodes in the Ceph cluster can run with the stock CentOS kernel.

Wido

> Under what circumstances would you recommend adding EL-Repo to CentOS 7.6, 
> and installing kernel-ml?  Are there certain parts of Ceph which particularly 
> benefit from kernels newer that 3.10?
> 
> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director - Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Alfredo Deza
On Wed, Jul 24, 2019 at 3:49 PM Peter Eisch 
wrote:

>
>
> I’m at step 6.  I updated/rebooted the host to complete “installing the
> new packages and restarting the ceph-osd daemon” on the first OSD host.
> All the systemctl definitions to start the OSDs were deleted, all the
> properties in /var/lib/ceph/osd/ceph-* directories were deleted.  All the
> files in /var/lib/ceph/osd-lockbox, for comparison, were untouched and
> still present.
>
>
>
> Peeking into step 7 I can run ceph-volume:
>
>
>
> # ceph-volume simple scan /dev/sda1
>
> Running command: /usr/sbin/cryptsetup status /dev/sda1
>
> Running command: /usr/sbin/cryptsetup status
> 93fb5f2f-0273-4c87-a718-886d7e6db983
>
> Running command: /bin/mount -v /dev/sda5 /tmp/tmpF5F8t2
>
> stdout: mount: /dev/sda5 mounted on /tmp/tmpF5F8t2.
>
> Running command: /usr/sbin/cryptsetup status /dev/sda5
>
> Running command: /bin/ceph --cluster ceph --name
> client.osd-lockbox.93fb5f2f-0273-4c87-a718-886d7e6db983 --keyring
> /tmp/tmpF5F8t2/keyring config-key get
> dm-crypt/osd/93fb5f2f-0273-4c87-a718-886d7e6db983/luks
>
> Running command: /bin/umount -v /tmp/tmpF5F8t2
>
> stderr: umount: /tmp/tmpF5F8t2 (/dev/sda5) unmounted
>
> Running command: /usr/sbin/cryptsetup --key-file - --allow-discards
> luksOpen /dev/sda1 93fb5f2f-0273-4c87-a718-886d7e6db983
>
> Running command: /bin/mount -v
> /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 /tmp/tmpYK0WEV
>
> stdout: mount: /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 mounted on
> /tmp/tmpYK0WEV.
>
> --> broken symlink found /tmp/tmpYK0WEV/block ->
> /dev/mapper/a05b447c-c901-4690-a249-cc1a2d62a110
>
> Running command: /usr/sbin/cryptsetup status /tmp/tmpYK0WEV/block_dmcrypt
>
> Running command: /usr/sbin/cryptsetup status
> /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
>
> Running command: /bin/umount -v /tmp/tmpYK0WEV
>
> stderr: umount: /tmp/tmpYK0WEV
> (/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983) unmounted
>
> Running command: /usr/sbin/cryptsetup remove
> /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
>
> --> OSD 0 got scanned and metadata persisted to file:
> /etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
>
> --> To take over management of this scanned OSD, and disable ceph-disk and
> udev, run:
>
> --> ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
>
> #
>
> #
>
> # ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
>
> --> Required devices (block and data) not present for bluestore
>
> --> bluestore devices found: [u'data']
>
> -->  RuntimeError: Unable to activate bluestore OSD due to missing devices
>
> #
>

The tool detected bluestore, or rather, it failed to find a journal
associated with /dev/sda1. Scanning a single partition can cause that.
There is a flag to spit out the findings to STDOUT instead of persisting
them in /etc/ceph/osd/

Since this is a "whole system" upgrade, then the upgrade documentation
instructions need to be followed:

ceph-volume simple scan
ceph-volume simple activate --all


If the `scan` command doesn't display any information (not even with the
--stdout flag) then the logs at /var/log/ceph/ceph-volume.log need to be
inspected. It would be useful to check any findings in there


>
> Okay, this created /etc/ceph/osd/*.json.  This is cool.  Is there a
> command or option which will read these files and mount the devices?
>
>
>
> peter
>
>
>
>
>
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> *From: *Alfredo Deza 
> *Date: *Wednesday, July 24, 2019 at 2:20 PM
> *To: *Peter Eisch 
> *Cc: *Paul Emmerich , "ceph-users@lists.ceph.com"
> 
> *Subject: *Re: [ceph-users] Upgrading and lost OSDs
>
>
>
> On Wed, Jul 24, 2019 at 2:56 PM Peter Eisch 
> wrote:
>
> Hi Paul,
>
> 

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch

I’m at step 6.  I updated/rebooted the host to complete “installing the new 
packages and restarting the ceph-osd daemon” on the first OSD host.  All the 
systemctl definitions to start the OSDs were deleted, all the properties in 
/var/lib/ceph/osd/ceph-* directories were deleted.  All the files in 
/var/lib/ceph/osd-lockbox, for comparison, were untouched and still present.

Peeking into step 7 I can run ceph-volume:

# ceph-volume simple scan /dev/sda1
Running command: /usr/sbin/cryptsetup status /dev/sda1
Running command: /usr/sbin/cryptsetup status 
93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/sda5 /tmp/tmpF5F8t2
stdout: mount: /dev/sda5 mounted on /tmp/tmpF5F8t2.
Running command: /usr/sbin/cryptsetup status /dev/sda5
Running command: /bin/ceph --cluster ceph --name 
client.osd-lockbox.93fb5f2f-0273-4c87-a718-886d7e6db983 --keyring 
/tmp/tmpF5F8t2/keyring config-key get 
dm-crypt/osd/93fb5f2f-0273-4c87-a718-886d7e6db983/luks
Running command: /bin/umount -v /tmp/tmpF5F8t2
stderr: umount: /tmp/tmpF5F8t2 (/dev/sda5) unmounted
Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen 
/dev/sda1 93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/mount -v /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 
/tmp/tmpYK0WEV
stdout: mount: /dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983 mounted on 
/tmp/tmpYK0WEV.
--> broken symlink found /tmp/tmpYK0WEV/block -> 
/dev/mapper/a05b447c-c901-4690-a249-cc1a2d62a110
Running command: /usr/sbin/cryptsetup status /tmp/tmpYK0WEV/block_dmcrypt
Running command: /usr/sbin/cryptsetup status 
/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
Running command: /bin/umount -v /tmp/tmpYK0WEV
stderr: umount: /tmp/tmpYK0WEV 
(/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983) unmounted
Running command: /usr/sbin/cryptsetup remove 
/dev/mapper/93fb5f2f-0273-4c87-a718-886d7e6db983
--> OSD 0 got scanned and metadata persisted to file: 
/etc/ceph/osd/0-93fb5f2f-0273-4c87-a718-886d7e6db983.json
--> To take over management of this scanned OSD, and disable ceph-disk and 
udev, run:
--> ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
#
#
# ceph-volume simple activate 0 93fb5f2f-0273-4c87-a718-886d7e6db983
--> Required devices (block and data) not present for bluestore
--> bluestore devices found: [u'data']
-->  RuntimeError: Unable to activate bluestore OSD due to missing devices
#

Okay, this created /etc/ceph/osd/*.json.  This is cool.  Is there a command or 
option which will read these files and mount the devices?

peter




Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Alfredo Deza 
Date: Wednesday, July 24, 2019 at 2:20 PM
To: Peter Eisch 
Cc: Paul Emmerich , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

On Wed, Jul 24, 2019 at 2:56 PM Peter Eisch 
mailto:peter.ei...@virginpulse.com>> wrote:
Hi Paul,

To do better to answer you question, I'm following: 
http://docs.ceph.com/docs/nautilus/releases/nautilus/

At step 6, upgrade OSDs, I jumped on an OSD host and did a full 'yum update' 
for patching the host and rebooted to pick up the current centos kernel.

If you are at Step 6 then it is *crucial* to understand that the tooling used 
to create the OSDs is no longer available and Step 7 *is absolutely required*.

ceph-volume has to scan the system and give you the output of all OSDs found so 
that it can persist them in /etc/ceph/osd/*.json files and then can later be
"activated".


I didn't do anything to specific commands for just updating the ceph RPMs in 
this process.

It is not clear if you are at Step 6 and wondering why OSDs are not up, or you 
are past that and ceph-volume wasn't able to detect anything.

peter
Peter 

Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Wido den Hollander


On 7/24/19 9:35 PM, Mark Schouten wrote:
> I’d say the cure is worse than the issue you’re trying to fix, but that’s my 
> two cents.
> 

I'm not completely happy with it either. Yes, the price goes up and
latency increases as well.

Right now I'm just trying to find a clever solution to this. It's a 2k
OSD cluster and the likelihood of an host or OSD crashing is reasonable
while you are performing maintenance on a different host.

All kinds of things have crossed my mind where using size=4 is one of them.

Wido

> Mark Schouten
> 
>> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander  het 
>> volgende geschreven:
>>
>> Hi,
>>
>> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>>
>> The reason I'm asking is that a customer of mine asked me for a solution
>> to prevent a situation which occurred:
>>
>> A cluster running with size=3 and replication over different racks was
>> being upgraded from 13.2.5 to 13.2.6.
>>
>> During the upgrade, which involved patching the OS as well, they
>> rebooted one of the nodes. During that reboot suddenly a node in a
>> different rack rebooted. It was unclear why this happened, but the node
>> was gone.
>>
>> While the upgraded node was rebooting and the other node crashed about
>> 120 PGs were inactive due to min_size=2
>>
>> Waiting for the nodes to come back, recovery to finish it took about 15
>> minutes before all VMs running inside OpenStack were back again.
>>
>> As you are upgraded or performing any maintenance with size=3 you can't
>> tolerate a failure of a node as that will cause PGs to go inactive.
>>
>> This made me think about using size=4 and min_size=2 to prevent this
>> situation.
>>
>> This obviously has implications on write latency and cost, but it would
>> prevent such a situation.
>>
>> Is anybody here running a Ceph cluster with size=4 and min_size=2 for
>> this reason?
>>
>> Thank you,
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Kernel, Distro & Ceph

2019-07-24 Thread DHilsbos
All;

There's been a lot of discussion of various kernel versions on this list 
lately, so I thought I'd seek some clarification.

I prefer to run CentOS, and I prefer to keep the number of "extra" repositories 
to a minimum.  Ceph requires adding a Ceph repo, and the EPEL repo.  Updating 
the kernel requires (from the research I've done) adding EL-Repo.  I believe 
CentOS 7 uses the 3.10 kernel.

Under what circumstances would you recommend adding EL-Repo to CentOS 7.6, and 
installing kernel-ml?  Are there certain parts of Ceph which particularly 
benefit from kernels newer that 3.10?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Mark Schouten
I’d say the cure is worse than the issue you’re trying to fix, but that’s my 
two cents.

Mark Schouten

> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander  het 
> volgende geschreven:
> 
> Hi,
> 
> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> 
> The reason I'm asking is that a customer of mine asked me for a solution
> to prevent a situation which occurred:
> 
> A cluster running with size=3 and replication over different racks was
> being upgraded from 13.2.5 to 13.2.6.
> 
> During the upgrade, which involved patching the OS as well, they
> rebooted one of the nodes. During that reboot suddenly a node in a
> different rack rebooted. It was unclear why this happened, but the node
> was gone.
> 
> While the upgraded node was rebooting and the other node crashed about
> 120 PGs were inactive due to min_size=2
> 
> Waiting for the nodes to come back, recovery to finish it took about 15
> minutes before all VMs running inside OpenStack were back again.
> 
> As you are upgraded or performing any maintenance with size=3 you can't
> tolerate a failure of a node as that will cause PGs to go inactive.
> 
> This made me think about using size=4 and min_size=2 to prevent this
> situation.
> 
> This obviously has implications on write latency and cost, but it would
> prevent such a situation.
> 
> Is anybody here running a Ceph cluster with size=4 and min_size=2 for
> this reason?
> 
> Thank you,
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Paul Emmerich
We got a few size=4 pools, but most of them are metadata pools paired with
m=3 or m=4 erasure coded pools for the actual data.
Goal is to provide the same availability and durability guarantees for the
metadata as the data.

But we got some older odd setup with replicated size=4 for that reason
(setup predates nautilus so no ec overrides originally).

I'd prefer erasure coding over a size=4 setup for most scenarios nowadays.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 9:22 PM Wido den Hollander  wrote:

> Hi,
>
> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
>
> The reason I'm asking is that a customer of mine asked me for a solution
> to prevent a situation which occurred:
>
> A cluster running with size=3 and replication over different racks was
> being upgraded from 13.2.5 to 13.2.6.
>
> During the upgrade, which involved patching the OS as well, they
> rebooted one of the nodes. During that reboot suddenly a node in a
> different rack rebooted. It was unclear why this happened, but the node
> was gone.
>
> While the upgraded node was rebooting and the other node crashed about
> 120 PGs were inactive due to min_size=2
>
> Waiting for the nodes to come back, recovery to finish it took about 15
> minutes before all VMs running inside OpenStack were back again.
>
> As you are upgraded or performing any maintenance with size=3 you can't
> tolerate a failure of a node as that will cause PGs to go inactive.
>
> This made me think about using size=4 and min_size=2 to prevent this
> situation.
>
> This obviously has implications on write latency and cost, but it would
> prevent such a situation.
>
> Is anybody here running a Ceph cluster with size=4 and min_size=2 for
> this reason?
>
> Thank you,
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Anybody using 4x (size=4) replication?

2019-07-24 Thread Wido den Hollander
Hi,

Is anybody using 4x (size=4, min_size=2) replication with Ceph?

The reason I'm asking is that a customer of mine asked me for a solution
to prevent a situation which occurred:

A cluster running with size=3 and replication over different racks was
being upgraded from 13.2.5 to 13.2.6.

During the upgrade, which involved patching the OS as well, they
rebooted one of the nodes. During that reboot suddenly a node in a
different rack rebooted. It was unclear why this happened, but the node
was gone.

While the upgraded node was rebooting and the other node crashed about
120 PGs were inactive due to min_size=2

Waiting for the nodes to come back, recovery to finish it took about 15
minutes before all VMs running inside OpenStack were back again.

As you are upgraded or performing any maintenance with size=3 you can't
tolerate a failure of a node as that will cause PGs to go inactive.

This made me think about using size=4 and min_size=2 to prevent this
situation.

This obviously has implications on write latency and cost, but it would
prevent such a situation.

Is anybody here running a Ceph cluster with size=4 and min_size=2 for
this reason?

Thank you,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Alfredo Deza
On Wed, Jul 24, 2019 at 2:56 PM Peter Eisch 
wrote:

> Hi Paul,
>
> To do better to answer you question, I'm following:
> http://docs.ceph.com/docs/nautilus/releases/nautilus/
>
> At step 6, upgrade OSDs, I jumped on an OSD host and did a full 'yum
> update' for patching the host and rebooted to pick up the current centos
> kernel.
>

If you are at Step 6 then it is *crucial* to understand that the tooling
used to create the OSDs is no longer available and Step 7 *is absolutely
required*.

ceph-volume has to scan the system and give you the output of all OSDs
found so that it can persist them in /etc/ceph/osd/*.json files and then
can later be
"activated".


> I didn't do anything to specific commands for just updating the ceph RPMs
> in this process.
>
>
It is not clear if you are at Step 6 and wondering why OSDs are not up, or
you are past that and ceph-volume wasn't able to detect anything.


> peter
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> From: Paul Emmerich 
> Date: Wednesday, July 24, 2019 at 1:39 PM
> To: Peter Eisch 
> Cc: Xavier Trilla , "ceph-users@lists.ceph.com"
> 
> Subject: Re: [ceph-users] Upgrading and lost OSDs
>
> On Wed, Jul 24, 2019 at 8:36 PM Peter Eisch  peter.ei...@virginpulse.com> wrote:
> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> ├─sda2 8:2 0 1.7T 0 part
> └─sda5 8:5 0 10M 0 part
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> ├─sdb2 8:18 0 1.7T 0 part
> └─sdb5 8:21 0 10M 0 part
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 0 part
>
> That's ceph-disk which was removed, run "ceph-volume simple scan"
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at
> https://nam02.safelinks.protection.outlook.com/?url=https://croit.io=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843215231=YEQI+UvikVPVeOFNSB2ikqVRiul8ElD3JEZDVOQI+NY==0
> 
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
>
> https://nam02.safelinks.protection.outlook.com/?url=http://www.croit.io=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843225224=83sD9wJHxE5W0renuDE7RGR/cPznR6jl9rEfl1AO+oA==0
> 
> Tel: +49 89 1896585 90
>
>
> ...
> I'm thinking the OSD would start (I can recreate the .service definitions
> in systemctl) if the above were mounted in a way like they are on another
> of my hosts:
> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> │ └─97712be4-1234-4acc-8102-2265769053a5 253:17 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-16
> ├─sda2 8:2 0 1.7T 0 part
> │ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16 0 1.7T 0 crypt
> └─sda5 8:5 0 10M 0 part
> /var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> │ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-17
> ├─sdb2 8:18 0 1.7T 0 part
> │ └─51177019-1234-4963-82d1-5006233f5ab2 253:30 0 1.7T 0 crypt
> └─sdb5 8:21 0 10M 0 part
> /var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Paul Emmerich
+1 on adding them all at the same time.

All these methods that gradually increase the weight aren't really
necessary in newer releases of Ceph.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:59 PM Reed Dier  wrote:

> Just chiming in to say that this too has been my preferred method for
> adding [large numbers of] OSDs.
>
> Set the norebalance nobackfill flags.
> Create all the OSDs, and verify everything looks good.
> Make sure my max_backfills, recovery_max_active are as expected.
> Make sure everything has peered.
> Unset flags and let it run.
>
> One crush map change, one data movement.
>
> Reed
>
>
> That works, but with newer releases I've been doing this:
>
> - Make sure cluster is HEALTH_OK
> - Set the 'norebalance' flag (and usually nobackfill)
> - Add all the OSDs
> - Wait for the PGs to peer. I usually wait a few minutes
> - Remove the norebalance and nobackfill flag
> - Wait for HEALTH_OK
>
> Wido
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph durability during outages

2019-07-24 Thread Nathan Fish
It is inherently dangerous to accept client IO - particularly writes -
when at k. Just like it's dangerous to accept IO with 1 replica in
replicated mode. It is not inherently dangerous to do recovery when at
k, but apparently it was originally written to use min_size rather
than k.
Looking at the PR, the actual code change is fairly small, ~30 lines,
but it's a fairly critical change and has several pages of testing
code associated with it. It also requires setting
"osd_allow_recovery_below_min_size" just in case. So clearly it is
being treated with caution.


On Wed, Jul 24, 2019 at 2:28 PM Jean-Philippe Méthot
 wrote:
>
> Thank you, that does make sense. I was completely unaware that min size was 
> k+1 and not k. Had I known that, I would have designed this pool differently.
>
> Regarding that feature for Octopus, I’m guessing it shouldn't be dangerous 
> for data integrity to recover at less than min_size?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> Le 24 juill. 2019 à 13:49, Nathan Fish  a écrit :
>
> 2/3 monitors is enough to maintain quorum, as is any majority.
>
> However, EC pools have a default min_size of  k+1 chunks.
> This can be adjusted to k, but that has it's own dangers.
> I assume you are using failure domain = "host"?
> As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
> resulting in all IO stopping.
>
> Currently, EC pools that have k chunks but less than min_size do not rebuild.
> This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619
>
> k=6,m=2 is therefore somewhat slim for a 10-host cluster.
> I do not currently use EC, as I have only 3 failure domains, so others
> here may know better than me,
> but I might have done k=6,m=3. This would allow rebuilding to OK from
> 1 host failure, and remaining available in WARN state with 2 hosts
> down.
> k=4,m=4 would be very safe, but potentially too expensive.
>
>
> On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
>  wrote:
>
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This 
> cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 
> erasure code profile with a 3 copy metadata pool in front. The cluster runs 
> well, but we recently had a short outage which triggered unexpected behaviour 
> in the cluster.
>
> I’ve always been under the impression that Ceph would continue working 
> properly even if nodes would go down. I tested it several months ago with 
> this configuration and it worked fine as long as only 2 nodes went down. 
> However, this time, the first monitor as well as two osd nodes went down. As 
> a result, Openstack VMs were able to mount their rbd volume but unable to 
> read from it, even after the cluster had recovered with the following message 
> : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the 
> outage, so what could have prevented it from functioning? Is it because there 
> was only two monitors remaining? Or is it that reduced data availability 
> message? In that case, is my erasure coding configuration fine for that 
> number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Reed Dier
Just chiming in to say that this too has been my preferred method for adding 
[large numbers of] OSDs.

Set the norebalance nobackfill flags.
Create all the OSDs, and verify everything looks good.
Make sure my max_backfills, recovery_max_active are as expected.
Make sure everything has peered.
Unset flags and let it run.

One crush map change, one data movement.

Reed

> 
> That works, but with newer releases I've been doing this:
> 
> - Make sure cluster is HEALTH_OK
> - Set the 'norebalance' flag (and usually nobackfill)
> - Add all the OSDs
> - Wait for the PGs to peer. I usually wait a few minutes
> - Remove the norebalance and nobackfill flag
> - Wait for HEALTH_OK
> 
> Wido
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi Paul,

To do better to answer you question, I'm following: 
http://docs.ceph.com/docs/nautilus/releases/nautilus/

At step 6, upgrade OSDs, I jumped on an OSD host and did a full 'yum update' 
for patching the host and rebooted to pick up the current centos kernel.

I didn't do anything to specific commands for just updating the ceph RPMs in 
this process.

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Paul Emmerich 
Date: Wednesday, July 24, 2019 at 1:39 PM
To: Peter Eisch 
Cc: Xavier Trilla , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

On Wed, Jul 24, 2019 at 8:36 PM Peter Eisch 
 wrote:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.7T 0 disk
├─sda1 8:1 0 100M 0 part
├─sda2 8:2 0 1.7T 0 part
└─sda5 8:5 0 10M 0 part
sdb 8:16 0 1.7T 0 disk
├─sdb1 8:17 0 100M 0 part
├─sdb2 8:18 0 1.7T 0 part
└─sdb5 8:21 0 10M 0 part
sdc 8:32 0 1.7T 0 disk
├─sdc1 8:33 0 100M 0 part

That's ceph-disk which was removed, run "ceph-volume simple scan"


--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://nam02.safelinks.protection.outlook.com/?url=https://croit.io=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843215231=YEQI+UvikVPVeOFNSB2ikqVRiul8ElD3JEZDVOQI+NY==0

croit GmbH
Freseniusstr. 31h
81247 München
https://nam02.safelinks.protection.outlook.com/?url=http://www.croit.io=02|01|peter.ei...@virginpulse.com|93235ab7971a4beceab708d710664a14|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995903843225224=83sD9wJHxE5W0renuDE7RGR/cPznR6jl9rEfl1AO+oA==0
Tel: +49 89 1896585 90

 
...
I'm thinking the OSD would start (I can recreate the .service definitions in 
systemctl) if the above were mounted in a way like they are on another of my 
hosts:
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.7T 0 disk
├─sda1 8:1 0 100M 0 part
│ └─97712be4-1234-4acc-8102-2265769053a5 253:17 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-16
├─sda2 8:2 0 1.7T 0 part
│ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16 0 1.7T 0 crypt
└─sda5 8:5 0 10M 0 part 
/var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
sdb 8:16 0 1.7T 0 disk
├─sdb1 8:17 0 100M 0 part
│ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-17
├─sdb2 8:18 0 1.7T 0 part
│ └─51177019-1234-4963-82d1-5006233f5ab2 253:30 0 1.7T 0 crypt
└─sdb5 8:21 0 10M 0 part 
/var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
sdc 8:32 0 1.7T 0 disk
├─sdc1 8:33 0 100M 0 part
│ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8 0 98M 0 crypt 
/var/lib/ceph/osd/ceph-18
├─sdc2 8:34 0 1.7T 0 part
│ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13 0 1.7T 0 crypt
...

Thank you for your time on this,

peter

From: Xavier Trilla 
Date: Wednesday, July 24, 2019 at 1:25 PM
To: Peter Eisch 
Cc: "mailto:ceph-users@lists.ceph.com; 
Subject: Re: [ceph-users] Upgrading and lost OSDs

Hi Peter,

Im not sure but maybe after some changes the OSDs are not being recongnized by 
ceph scripts.

Ceph used to use udev to detect the OSDs and then moved to lvm, which kind of 
OSDs are you running? Blustore or filestore? Which version did you use to 
create them?

Cheers!

El 24 jul 2019, a les 20:04, Peter Eisch 
 va escriure:
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6. The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel. Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs 

Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Wido den Hollander



On 7/24/19 6:02 PM, Sinan Polat wrote:
> Hi,
> 
> Why not using backup tools that can do native OpenStack backups?
> 
> We are also using Ceph as the cinder backend on our OpenStack platform. We 
> use CommVault to make our backups.

How much data is there in that Ceph cluster? And how does it perform?

I assume it exports the RBD images through Cinder and stores them
somewhere? But this is probably a complete copy of the RBD image and I
just wonder how that scales.

A lot of systems are >500TB and it's very difficult to back that up
every week or so.

Wido

> 
> - Sinan
> 
>> Op 24 jul. 2019 om 17:48 heeft Wido den Hollander  het 
>> volgende geschreven:
>>
>>
>>
>>> On 7/24/19 4:06 PM, Fabian Niepelt wrote:
>>> Hi, thanks for the reply.
>>>
>>> Am Mittwoch, den 24.07.2019, 15:26 +0200 schrieb Wido den Hollander:

 On 7/24/19 1:37 PM, Fabian Niepelt wrote:
> Hello ceph-users,
>
> I am currently building a Ceph cluster that will serve as a backend for
> Openstack and object storage using RGW. The cluster itself is finished and
> integrated with Openstack and virtual machines for testing are being
> deployed.
> Now I'm a bit stumped on how to effectively backup the Ceph pools.
> My requirements are two weekly backups, of which one must be offline after
> finishing backing up (systems turned powerless). We are expecting about
> 250TB to
> 500TB of data for now. The backups must protect against accidental pool
> deletion/corruption or widespread infection of a cryptovirus. In short:
> Complete
> data loss in the production Ceph cluster.
>
> At the moment, I am facing two issues:
>
> 1. For the cinder pool, I looked into creating snapshots using the ceph 
> CLI
> (so
> they don't turn up in Openstack and cannot be accidentally deleted by 
> users)
> and
> exporting their diffs. But volumes with snapshots created this way cannot 
> be
> removed from Openstack. Does anyone have an idea how to do this better?

 You mean that while you leave the snapshot there OpenStack can't remove it?
>>>
>>> Yes, that is correct. cinder-volume cannot remove a volume that still has a
>>> snapshot. If the snapshot is created by openstack, it will remove the 
>>> snapshot
>>> before removing the volume. But snapshotting directly from ceph will forego
>>> Openstack so it will never know about that snapshot's existence.
>>>
>>
>> Ah, yes. That means you would need to remove it manually.
>>
> Alternatively, I could do a full export each week, but I am not sure if 
> that
> would be fast enough..
>

 It probably won't, but the full backup is still the safest way imho.
 However: Does this scale?

 You can export multiple RBD images in parallel and store them somewhere
 else, but it will still take a long time.

 The export needs to be stored somewhere and then picked up. Or you could
 use some magic with Netcat to stream the RBD export to a destination host.

>>>
>>> Scaling is also my biggest worry about this.
>>>
> 2. My search so far has only turned up backing up RBD pools, but how 
> could I
> backup the pools that are used for object storage?
>

 Not easily. I think you mean RGW? You could try the RGW MultiSite, but
 it's difficult.

 A complete DR with Ceph to restore it back to how it was at a given
 point in time is a challenge.

>>>
>>> Yes, I would like to backup the pools used by the RGW.
>>
>> Not really an option. You would need to use the RGW MultiSite to
>> replicate all data to a second environment.
>>
>>>
> Of course, I'm also open to completely other ideas on how to backup Ceph 
> and
> would appreciate hearing how you people are doing your backups.

 A lot of time the backups are created inside the VMs on File level. And
 there is a second OpenStack+Ceph system which runs a mirror of the VMs
 or application. If one burns down it's not the end of the world.

 Trying to backup a Ceph cluster sounds very 'enterprise' and is
 difficult to scale as well.

>>>
>>> Are those backups saved in Ceph as well? I cannot solely rely on Ceph 
>>> because we
>>> want to protect ourselves against failures in Ceph or a human accidentally 
>>> or
>>> maliciously deletes all pools.
>>>
>>
>> That second OpenStack+Ceph environment is completely different. All the
>> VMs are set up twice and using replication and backups on application
>> level such things are redundant.
>>
>> Think about MySQL replication for example.
>>
>>> From what I'm reading, it seems to be better to maybe implement a backup
>>> solution outside of Ceph that our Openstack users can use and not deal with
>>> backing up Ceph at all, except its configs to get it running after total
>>> desaster...
>>>
>>
>> You could backup OpenStack's MySQL database, the ceph config and then
>> backup the data inside the 

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
[2019-07-24 13:40:49,602][ceph_volume.process][INFO  ] Running command: 
/bin/systemctl show --no-pager --property=Id --state=running ceph-osd@*

This is the only log event.  At the prompt:

# ceph-volume simple scan
#

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Paul Emmerich 
Date: Wednesday, July 24, 2019 at 1:32 PM
To: Xavier Trilla 
Cc: Peter Eisch , "ceph-users@lists.ceph.com" 

Subject: Re: [ceph-users] Upgrading and lost OSDs

Did you use ceph-disk before?

Support for ceph-disk was removed, see Nautilus upgrade instructions. You'll 
need to run "ceph-volume simple scan" to convert them to ceph-volume

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at 
https://nam02.safelinks.protection.outlook.com/?url=https://croit.io=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574324430=qGIIxPCaeiKjrJ2F7enE5NrjY3vfv7fGNaO/gr1RYto==0

croit GmbH
Freseniusstr. 31h
81247 München
https://nam02.safelinks.protection.outlook.com/?url=http://www.croit.io=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574324430=bw4LllzkQlTarUimhd/JattNfA1ULqdSgtUYmVnhdhU==0
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:25 PM Xavier Trilla 
 wrote:
Hi Peter,

Im not sure but maybe after some changes the OSDs are not being recongnized by 
ceph scripts.

Ceph used to use udev to detect the OSDs and then moved to lvm, which kind of 
OSDs are you running? Blustore or filestore? Which version did you use to 
create them?

Cheers!

El 24 jul 2019, a les 20:04, Peter Eisch  
va escriure:
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6. The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel. Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this. I 
can manage the undersized but I'd like to get these drives working again 
without deleting each OSD and recreating them.

So far I've pulled the respective cephx key into the 'keyring' file and 
populated 'bluestore' into the 'type' files but I'm unsure how to get the 
lockboxes mounted to where I can get the OSDs running. The osd-lockbox 
directory is otherwise untouched from when the OSDs were deployed.

Is there a way to run ceph-deploy or some other tool to rebuild the mounts for 
the drives?

peter
___
ceph-users mailing list
mailto:ceph-users@lists.ceph.com
https://nam02.safelinks.protection.outlook.com/?url=http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com=02|01|peter.ei...@virginpulse.com|d36b1ddd859a4312cc5908d710654b4f|b123a16e892b4cf6a55a6f8c7606a035|0|0|636995899574354413=V1IPNZgsCojA+RPbPRQop6R0zGGWTtovUtrg7toHMrs==0

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Wido den Hollander



On 7/24/19 7:15 PM, Kevin Hrpcek wrote:
> I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what
> I do, you can obviously change the weight increase steps to what you are
> comfortable with. This has worked well for me and my workloads. I've
> sometimes seen peering take longer if I do steps too quickly but I don't
> run any mission critical has to be up 100% stuff and I usually don't
> notice if a pg takes a while to peer.
> 
> Add all OSDs with an initial weight of 0. (nothing gets remapped)
> Ensure cluster is healthy.
> Use a for loop to increase weight on all news OSDs to 0.5 with a
> generous sleep between each for peering.
> Let the cluster balance and get healthy or close to healthy.
> Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until
> I am at the desired weight.

That works, but with newer releases I've been doing this:

- Make sure cluster is HEALTH_OK
- Set the 'norebalance' flag (and usually nobackfill)
- Add all the OSDs
- Wait for the PGs to peer. I usually wait a few minutes
- Remove the norebalance and nobackfill flag
- Wait for HEALTH_OK

Wido

> 
> Kevin
> 
> On 7/24/19 11:44 AM, Xavier Trilla wrote:
>>
>> Hi,
>>
>>  
>>
>> What would be the proper way to add 100 new OSDs to a cluster?
>>
>>  
>>
>> I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I
>> would like to know how you do it.
>>
>>  
>>
>> Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one,
>> and it can handle plenty of load, but for the sake of safety -it hosts
>> thousands of VMs via RBD- we usually add them one by one, waiting for
>> a long time between adding each OSD.
>>
>>  
>>
>> Obviously this leads to PLENTY of data movement, as each time the
>> cluster geometry changes, data is migrated among all the OSDs. But
>> with the kind of load we have, if we add several OSDs at the same
>> time, some PGs can get stuck for a while, while they peer to the new OSDs.
>>
>>  
>>
>> Now that I have to add > 100 new OSDs I was wondering if somebody has
>> some suggestions.
>>
>>  
>>
>> Thanks!
>>
>> Xavier.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Paul Emmerich
On Wed, Jul 24, 2019 at 8:36 PM Peter Eisch 
wrote:

> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> ├─sda2 8:2 0 1.7T 0 part
> └─sda5 8:5 0 10M 0 part
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> ├─sdb2 8:18 0 1.7T 0 part
> └─sdb5 8:21 0 10M 0 part
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 0 part
>

That's ceph-disk which was removed, run "ceph-volume simple scan"

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



> ...
> I'm thinking the OSD would start (I can recreate the .service definitions
> in systemctl) if the above were mounted in a way like they are on another
> of my hosts:
> # lsblk
> NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda 8:0 0 1.7T 0 disk
> ├─sda1 8:1 0 100M 0 part
> │ └─97712be4-1234-4acc-8102-2265769053a5 253:17 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-16
> ├─sda2 8:2 0 1.7T 0 part
> │ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16 0 1.7T 0 crypt
> └─sda5 8:5 0 10M 0 part
> /var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
> sdb 8:16 0 1.7T 0 disk
> ├─sdb1 8:17 0 100M 0 part
> │ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-17
> ├─sdb2 8:18 0 1.7T 0 part
> │ └─51177019-1234-4963-82d1-5006233f5ab2 253:30 0 1.7T 0 crypt
> └─sdb5 8:21 0 10M 0 part
> /var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
> sdc 8:32 0 1.7T 0 disk
> ├─sdc1 8:33 0 100M 0 part
> │ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8 0 98M 0 crypt
> /var/lib/ceph/osd/ceph-18
> ├─sdc2 8:34 0 1.7T 0 part
> │ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13 0 1.7T 0 crypt
> ...
>
> Thank you for your time on this,
>
> peter
>
> Peter Eisch
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
> [image: Facebook] 
> [image: LinkedIn] 
> [image: Twitter] 
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> From: Xavier Trilla 
> Date: Wednesday, July 24, 2019 at 1:25 PM
> To: Peter Eisch 
> Cc: "ceph-users@lists.ceph.com" 
> Subject: Re: [ceph-users] Upgrading and lost OSDs
>
> Hi Peter,
>
> Im not sure but maybe after some changes the OSDs are not being
> recongnized by ceph scripts.
>
> Ceph used to use udev to detect the OSDs and then moved to lvm, which kind
> of OSDs are you running? Blustore or filestore? Which version did you use
> to create them?
>
> Cheers!
>
> El 24 jul 2019, a les 20:04, Peter Eisch  peter.ei...@virginpulse.com> va escriure:
> Hi,
>
> I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on
> centos 7.6. The managers are updated alright:
>
> # ceph -s
>   cluster:
> id: 2fdb5976-1234-4b29-ad9c-1ca74a9466ec
> health: HEALTH_WARN
> Degraded data redundancy: 24177/9555955 objects degraded
> (0.253%), 7 pgs degraded, 1285 pgs undersized
> 3 monitors have not enabled msgr2
>  ...
>
> I updated ceph on a OSD host with 'yum update' and then rebooted to grab
> the current kernel. Along the way, the contents of all the directories in
> /var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this.
> I can manage the undersized but I'd like to get these drives working again
> without deleting each OSD and recreating them.
>
> So far I've pulled the respective cephx key into the 'keyring' file and
> populated 'bluestore' into the 'type' files but I'm unsure how to get the
> lockboxes mounted to where I can get the OSDs running. The osd-lockbox
> directory is otherwise untouched from when the OSDs were deployed.
>
> Is there a way to run ceph-deploy or some other tool to rebuild the mounts
> for the drives?
>
> peter
> Peter Eisch
> Senior Site Reliability Engineer
>
>
> T
>
> tel:1.612.659.3228
>
>
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Bluestore created with 12.2.10/luminous.

The OSD startup generates logs like:

2019-07-24 12:39:46.483 7f4b27649d80  0 set uid:gid to 167:167 (ceph:ceph)
2019-07-24 12:39:46.483 7f4b27649d80  0 ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, 
pid 48553
2019-07-24 12:39:46.483 7f4b27649d80  0 pidfile_write: ignore empty --pid-file
2019-07-24 12:39:46.483 7f4b27649d80  0 set uid:gid to 167:167 (ceph:ceph)
2019-07-24 12:39:46.483 7f4b27649d80  0 ceph version 14.2.2 
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process ceph-osd, 
pid 48553
2019-07-24 12:39:46.483 7f4b27649d80  0 pidfile_write: ignore empty --pid-file
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open 
/var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-07-24 12:39:46.505 7f4b27649d80 -1  ** ERROR: unable to open OSD 
superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
-
# lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda   8:00   1.7T  0 disk
├─sda18:10   100M  0 part
├─sda28:20   1.7T  0 part
└─sda58:5010M  0 part
sdb   8:16   0   1.7T  0 disk
├─sdb18:17   0   100M  0 part
├─sdb28:18   0   1.7T  0 part
└─sdb58:21   010M  0 part
sdc   8:32   0   1.7T  0 disk
├─sdc18:33   0   100M  0 part
...
I'm thinking the OSD would start (I can recreate the .service definitions in 
systemctl) if the above were mounted in a way like they are on another of my 
hosts:
# lsblk
NAME MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda8:00   1.7T  0 disk
├─sda1 8:10   100M  0 part
│ └─97712be4-1234-4acc-8102-2265769053a5 253:17   098M  0 crypt 
/var/lib/ceph/osd/ceph-16
├─sda2 8:20   1.7T  0 part
│ └─049b7160-1234-4edd-a5dc-fe00faca8d89 253:16   0   1.7T  0 crypt
└─sda5 8:5010M  0 part  
/var/lib/ceph/osd-lockbox/97712be4-9674-4acc-1234-2265769053a5
sdb8:16   0   1.7T  0 disk
├─sdb1 8:17   0   100M  0 part
│ └─f03f0298-1234-42e9-8b28-f3016e44d1e2 253:26   098M  0 crypt 
/var/lib/ceph/osd/ceph-17
├─sdb2 8:18   0   1.7T  0 part
│ └─51177019-1234-4963-82d1-5006233f5ab2 253:30   0   1.7T  0 crypt
└─sdb5 8:21   010M  0 part  
/var/lib/ceph/osd-lockbox/f03f0298-1234-42e9-8b28-f3016e44d1e2
sdc8:32   0   1.7T  0 disk
├─sdc1 8:33   0   100M  0 part
│ └─0184df0c-1234-404d-92de-cb71b1047abf 253:8098M  0 crypt 
/var/lib/ceph/osd/ceph-18
├─sdc2 8:34   0   1.7T  0 part
│ └─fdad7618-1234-4021-a63e-40d973712e7b 253:13   0   1.7T  0 crypt
...

Thank you for your time on this,

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
From: Xavier Trilla 
Date: Wednesday, July 24, 2019 at 1:25 PM
To: Peter Eisch 
Cc: "ceph-users@lists.ceph.com" 
Subject: Re: [ceph-users] Upgrading and lost OSDs

Hi Peter,

Im not sure but maybe after 

Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Paul Emmerich
Did you use ceph-disk before?

Support for ceph-disk was removed, see Nautilus upgrade instructions.
You'll need to run "ceph-volume simple scan" to convert them to ceph-volume

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 8:25 PM Xavier Trilla 
wrote:

> Hi Peter,
>
> Im not sure but maybe after some changes the OSDs are not being
> recongnized by ceph scripts.
>
> Ceph used to use udev to detect the OSDs and then moved to lvm, which kind
> of OSDs are you running? Blustore or filestore? Which version did you use
> to create them?
>
>
> Cheers!
>
> El 24 jul 2019, a les 20:04, Peter Eisch  va
> escriure:
>
> Hi,
>
> I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on
> centos 7.6. The managers are updated alright:
>
> # ceph -s
>   cluster:
> id: 2fdb5976-1234-4b29-ad9c-1ca74a9466ec
> health: HEALTH_WARN
> Degraded data redundancy: 24177/9555955 objects degraded
> (0.253%), 7 pgs degraded, 1285 pgs undersized
> 3 monitors have not enabled msgr2
>  ...
>
> I updated ceph on a OSD host with 'yum update' and then rebooted to grab
> the current kernel. Along the way, the contents of all the directories in
> /var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this.
> I can manage the undersized but I'd like to get these drives working again
> without deleting each OSD and recreating them.
>
> So far I've pulled the respective cephx key into the 'keyring' file and
> populated 'bluestore' into the 'type' files but I'm unsure how to get the
> lockboxes mounted to where I can get the OSDs running. The osd-lockbox
> directory is otherwise untouched from when the OSDs were deployed.
>
> Is there a way to run ceph-deploy or some other tool to rebuild the mounts
> for the drives?
>
> peter
>
> Peter Eisch​
> Senior Site Reliability Engineer
> T *1.612.659.3228* <1.612.659.3228>
>  
>  
>  
> *virginpulse.com* 
> | *virginpulse.com/global-challenge*
> 
>
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the designated
> recipient(s). Unauthorized use, dissemination, distribution, or
> reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or privileged
> information. Any views or opinions expressed are solely those of the author
> and do not necessarily represent those of Virgin Pulse, Inc. If you have
> received this message in error, or are not the named recipient(s), please
> immediately notify the sender and delete this e-mail message.
> v2.59
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Xavier Trilla
Hi Peter,

Im not sure but maybe after some changes the OSDs are not being recongnized by 
ceph scripts.

Ceph used to use udev to detect the OSDs and then moved to lvm, which kind of 
OSDs are you running? Blustore or filestore? Which version did you use to 
create them?


Cheers!

El 24 jul 2019, a les 20:04, Peter Eisch 
mailto:peter.ei...@virginpulse.com>> va escriure:

Hi,

I'm working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6. The managers are updated alright:

# ceph -s
  cluster:
id: 2fdb5976-1234-4b29-ad9c-1ca74a9466ec
health: HEALTH_WARN
Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel. Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted. Thus I have 16 OSDs down from this. I 
can manage the undersized but I'd like to get these drives working again 
without deleting each OSD and recreating them.

So far I've pulled the respective cephx key into the 'keyring' file and 
populated 'bluestore' into the 'type' files but I'm unsure how to get the 
lockboxes mounted to where I can get the OSDs running. The osd-lockbox 
directory is otherwise untouched from when the OSDs were deployed.

Is there a way to run ceph-deploy or some other tool to rebuild the mounts for 
the drives?

peter

Peter Eisch?
Senior Site Reliability Engineer


T
1.612.659.3228












virginpulse.com

|

virginpulse.com/global-challenge



Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA


Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.59


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Ch Wan
I usually add 20 OSDs each time.
To take control of the influence of backfilling, I will set
primary-affinity to 0 of those new OSDs and adjust backfilling
configurations.
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling


Kevin Hrpcek  于2019年7月25日周四 上午2:02写道:

> I change the crush weights. My 4 second sleep doesn't let peering finish
> for each one before continuing. I'd test with some small steps to get an
> idea of how much remaps when increasing the weight by $x. I've found my
> cluster is comfortable with +1 increases...also it take awhile to get to a
> weight of 11 if I did anything smaller.
>
> for i in {264..311}; do ceph osd crush reweight osd.${i} 11.0;sleep 4;done
>
> Kevin
>
> On 7/24/19 12:33 PM, Xavier Trilla wrote:
>
> Hi Kevin,
>
> Yeah, that makes a lot of sense, and looks even safer than adding OSDs one
> by one. What do you change, the crush weight? Or the reweight? (I guess you
> change the crush weight, I am right?)
>
> Thanks!
>
>
>
> El 24 jul 2019, a les 19:17, Kevin Hrpcek  va
> escriure:
>
> I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I
> do, you can obviously change the weight increase steps to what you are
> comfortable with. This has worked well for me and my workloads. I've
> sometimes seen peering take longer if I do steps too quickly but I don't
> run any mission critical has to be up 100% stuff and I usually don't notice
> if a pg takes a while to peer.
>
> Add all OSDs with an initial weight of 0. (nothing gets remapped)
> Ensure cluster is healthy.
> Use a for loop to increase weight on all news OSDs to 0.5 with a generous
> sleep between each for peering.
> Let the cluster balance and get healthy or close to healthy.
> Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I
> am at the desired weight.
>
> Kevin
>
> On 7/24/19 11:44 AM, Xavier Trilla wrote:
>
> Hi,
>
>
>
> What would be the proper way to add 100 new OSDs to a cluster?
>
>
>
> I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would
> like to know how you do it.
>
>
>
> Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and
> it can handle plenty of load, but for the sake of safety -it hosts
> thousands of VMs via RBD- we usually add them one by one, waiting for a
> long time between adding each OSD.
>
>
>
> Obviously this leads to PLENTY of data movement, as each time the cluster
> geometry changes, data is migrated among all the OSDs. But with the kind of
> load we have, if we add several OSDs at the same time, some PGs can get
> stuck for a while, while they peer to the new OSDs.
>
>
>
> Now that I have to add > 100 new OSDs I was wondering if somebody has some
> suggestions.
>
>
>
> Thanks!
>
> Xavier.
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrading and lost OSDs

2019-07-24 Thread Peter Eisch
Hi,

I’m working through updating from 12.2.12/luminious to 14.2.2/nautilus on 
centos 7.6.  The managers are updated alright:

# ceph -s
  cluster:
    id:     2fdb5976-1234-4b29-ad9c-1ca74a9466ec
    health: HEALTH_WARN
            Degraded data redundancy: 24177/9555955 objects degraded (0.253%), 
7 pgs degraded, 1285 pgs undersized
            3 monitors have not enabled msgr2
 ...

I updated ceph on a OSD host with 'yum update' and then rebooted to grab the 
current kernel.  Along the way, the contents of all the directories in 
/var/lib/ceph/osd/ceph-*/ were deleted.  Thus I have 16 OSDs down from this.  I 
can manage the undersized but I'd like to get these drives working again 
without deleting each OSD and recreating them.

So far I've pulled the respective cephx key into the 'keyring' file and 
populated 'bluestore' into the 'type' files but I'm unsure how to get the 
lockboxes mounted to where I can get the OSDs running.  The osd-lockbox 
directory is otherwise untouched from when the OSDs were deployed.

Is there a way to run ceph-deploy or some other tool to rebuild the mounts for 
the drives?

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.659.3228
virginpulse.com
|virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.59
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I change the crush weights. My 4 second sleep doesn't let peering finish for 
each one before continuing. I'd test with some small steps to get an idea of 
how much remaps when increasing the weight by $x. I've found my cluster is 
comfortable with +1 increases...also it take awhile to get to a weight of 11 if 
I did anything smaller.

for i in {264..311}; do ceph osd crush reweight osd.${i} 11.0;sleep 4;done

Kevin

On 7/24/19 12:33 PM, Xavier Trilla wrote:
Hi Kevin,

Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by 
one. What do you change, the crush weight? Or the reweight? (I guess you change 
the crush weight, I am right?)

Thanks!



El 24 jul 2019, a les 19:17, Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> va escriure:

I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ang: How to add 100 new OSDs...

2019-07-24 Thread da...@oderland.se
CERN has a pretty nice reweight script that we run when we add OSDs in production.https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweightMight be of help!Kind regards, David Majchrzak  Ursprungligt meddelande Ämne: Re: [ceph-users] How to add 100 new OSDs...Från: Xavier Trilla Till: Kevin Hrpcek Kopia: ceph-users@lists.ceph.com
Hi Kevin,


Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by one. What do you change, the crush weight? Or the reweight? (I guess you change the crush weight, I am right?)


Thanks!




El 24 jul 2019, a les 19:17, Kevin Hrpcek  va escriure:



I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, you can obviously change the weight increase steps to what you are comfortable with. This has worked well for me and my workloads. I've sometimes seen
 peering take longer if I do steps too quickly but I don't run any mission critical has to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:





Hi,
 
What would be the proper way to add 100 new OSDs to a cluster?
 
I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like to know how you do it.
 
Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it can handle plenty of load, but for the sake of safety -it hosts thousands of VMs via RBD- we usually add them one by one, waiting for a long
 time between adding each OSD.
 
Obviously this leads to PLENTY of data movement, as each time the cluster geometry changes, data is migrated among all the OSDs. But with the kind of load we have, if we add several OSDs at the same time, some PGs can
 get stuck for a while, while they peer to the new OSDs.
 
Now that I have to add > 100 new OSDs I was wondering if somebody has some suggestions.
 
Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph durability during outages

2019-07-24 Thread Nathan Fish
2/3 monitors is enough to maintain quorum, as is any majority.

However, EC pools have a default min_size of  k+1 chunks.
This can be adjusted to k, but that has it's own dangers.
I assume you are using failure domain = "host"?
As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
resulting in all IO stopping.

Currently, EC pools that have k chunks but less than min_size do not rebuild.
This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619

k=6,m=2 is therefore somewhat slim for a 10-host cluster.
I do not currently use EC, as I have only 3 failure domains, so others
here may know better than me,
but I might have done k=6,m=3. This would allow rebuilding to OK from
1 host failure, and remaining available in WARN state with 2 hosts
down.
k=4,m=4 would be very safe, but potentially too expensive.


On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
 wrote:
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This 
> cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 
> erasure code profile with a 3 copy metadata pool in front. The cluster runs 
> well, but we recently had a short outage which triggered unexpected behaviour 
> in the cluster.
>
> I’ve always been under the impression that Ceph would continue working 
> properly even if nodes would go down. I tested it several months ago with 
> this configuration and it worked fine as long as only 2 nodes went down. 
> However, this time, the first monitor as well as two osd nodes went down. As 
> a result, Openstack VMs were able to mount their rbd volume but unable to 
> read from it, even after the cluster had recovered with the following message 
> : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the 
> outage, so what could have prevented it from functioning? Is it because there 
> was only two monitors remaining? Or is it that reduced data availability 
> message? In that case, is my erasure coding configuration fine for that 
> number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Xavier Trilla
Hi Kevin,

Yeah, that makes a lot of sense, and looks even safer than adding OSDs one by 
one. What do you change, the crush weight? Or the reweight? (I guess you change 
the crush weight, I am right?)

Thanks!



El 24 jul 2019, a les 19:17, Kevin Hrpcek 
mailto:kevin.hrp...@ssec.wisc.edu>> va escriure:

I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Jason Dillaman
On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin  wrote:
>
> Hi Jason,
>
> i installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed 
> the crash reproduction.
> The problem also re-appeared with that kernel release.
>
> A gunzip with 10 gunzip processes throwed 1600 write and 330 read IOPS 
> against the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 
> 10 Minutes.
> After that the same problem re-appeared.
>
> What should we do now?
>
> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
> because the ceph apt source does not contain that version.
> Do you know a package source?

All the upstream packages should be available here [1], including 12.2.5.

> How can i support you?

Did you pull the OSD blocked ops stats to figure out what is going on
with the OSDs?

> Regards
> Marc
>
> Am 24.07.19 um 07:55 schrieb Marc Schöchlin:
> > Hi Jason,
> >
> > Am 24.07.19 um 00:40 schrieb Jason Dillaman:
> >>> Sure, which kernel do you prefer?
> >> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen 
> >> environment. Can you use a matching kernel version?
> >
> > Thats true, our virtual machines of our xen environments completly run on 
> > rbd-nbd devices.
> > Every host runs dozends of rbd-nbd maps which are visible as xen disks in 
> > the virtual systems.
> > (https://github.com/vico-research-and-consulting/RBDSR)
> >
> > It seems that xenserver has a special behavior with device timings because 
> > 1.5 years ago we had a outage of 1.5 hours of our ceph cluster which 
> > blocked all write requests
> > (overfull disks because of huge usage growth). In this situation all 
> > virtualmachines continue their work without problems after the cluster was 
> > back.
> > We haven't set any timeouts using nbd_set_timeout.c on these systems.
> >
> > We never experienced problems with these rbd-nbd instances.
> >
> > [root@xen-s31 ~]# rbd nbd ls
> > pid   pool   image  
> >   snap device
> > 10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 -/dev/nbd3
> > 12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 -/dev/nbd4
> > 13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-37243066-54b0-453a-8bf3-b958153a680d -/dev/nbd5
> > 15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> > RBD-2bee9bf7-4fed-4735-a749-2d4874181686 -/dev/nbd6
> > 15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf -/dev/nbd7
> > 27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 -/dev/nbd8
> > 21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-5c673a73-7827-44cc-802c-8d626da2f401 -/dev/nbd9
> > 15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab -/dev/nbd10
> > 4368  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-23b72184-0914-4924-8f7f-10868af7c0ab -/dev/nbd11
> > 4642  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 -/dev/nbd12
> > 9438  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca -/dev/nbd13
> > 29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 -/dev/nbd14
> > 4493  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 -/dev/nbd15
> > 4683  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-374cadac-d969-49eb-8269-aa125cba82d8 -/dev/nbd16
> > 1736  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 -/dev/nbd17
> > 3648  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 -/dev/nbd18
> > 9993  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> > RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e -/dev/nbd19
> > 10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb -/dev/nbd20
> > 19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b -/dev/nbd21
> > 14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> > RBD-69832522-fd68-49f9-810f-485947ff5e44 -/dev/nbd22
> > 20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> > RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 -/dev/nbd23
> > 19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> > 

Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Kevin Hrpcek
I often add 50+ OSDs at a time and my cluster is all NLSAS. Here is what I do, 
you can obviously change the weight increase steps to what you are comfortable 
with. This has worked well for me and my workloads. I've sometimes seen peering 
take longer if I do steps too quickly but I don't run any mission critical has 
to be up 100% stuff and I usually don't notice if a pg takes a while to peer.

Add all OSDs with an initial weight of 0. (nothing gets remapped)
Ensure cluster is healthy.
Use a for loop to increase weight on all news OSDs to 0.5 with a generous sleep 
between each for peering.
Let the cluster balance and get healthy or close to healthy.
Then repeat the previous 2 steps increasing weight by +0.5 or +1.0 until I am 
at the desired weight.

Kevin

On 7/24/19 11:44 AM, Xavier Trilla wrote:
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Marc Schöchlin
Hi Jason,

i installed kernel 4.4.0-154.181 (from ubuntu package sources) and performed 
the crash reproduction.
The problem also re-appeared with that kernel release.

A gunzip with 10 gunzip processes throwed 1600 write and 330 read IOPS against 
the cluster/the rbd_ec volume with a transfer rate of 290MB/sec for 10 Minutes.
After that the same problem re-appeared.

What should we do now?

Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
because the ceph apt source does not contain that version.
Do you know a package source?

How can i support you?

Regards
Marc

Am 24.07.19 um 07:55 schrieb Marc Schöchlin:
> Hi Jason,
>
> Am 24.07.19 um 00:40 schrieb Jason Dillaman:
>>> Sure, which kernel do you prefer?
>> You said you have never had an issue w/ rbd-nbd 12.2.5 in your Xen 
>> environment. Can you use a matching kernel version? 
>
> Thats true, our virtual machines of our xen environments completly run on 
> rbd-nbd devices.
> Every host runs dozends of rbd-nbd maps which are visible as xen disks in the 
> virtual systems.
> (https://github.com/vico-research-and-consulting/RBDSR)
>
> It seems that xenserver has a special behavior with device timings because 
> 1.5 years ago we had a outage of 1.5 hours of our ceph cluster which blocked 
> all write requests
> (overfull disks because of huge usage growth). In this situation all 
> virtualmachines continue their work without problems after the cluster was 
> back.
> We haven't set any timeouts using nbd_set_timeout.c on these systems.
>
> We never experienced problems with these rbd-nbd instances.
>
> [root@xen-s31 ~]# rbd nbd ls
> pid   pool   image
>     snap device
> 10405 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-72f4e61d-acb9-4679-9b1d-fe0324cb5436 -    /dev/nbd3 
> 12731 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-88f8889a-05dc-49ab-a7de-8b5f3961f9c9 -    /dev/nbd4 
> 13123 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-37243066-54b0-453a-8bf3-b958153a680d -    /dev/nbd5 
> 15342 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> RBD-2bee9bf7-4fed-4735-a749-2d4874181686 -    /dev/nbd6 
> 15702 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-5b93eb93-ebe7-4711-a16a-7893d24c1bbf -    /dev/nbd7 
> 27568 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-616a74b5-3f57-4123-9505-dbd4c9aa9be3 -    /dev/nbd8 
> 21112 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-5c673a73-7827-44cc-802c-8d626da2f401 -    /dev/nbd9 
> 15726 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-1069a275-d97f-48fd-9c52-aed1d8ac9eab -    /dev/nbd10
> 4368  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-23b72184-0914-4924-8f7f-10868af7c0ab -    /dev/nbd11
> 4642  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-bf13cf77-6115-466e-85c5-aa1d69a570a0 -    /dev/nbd12
> 9438  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-a2071aa0-5f63-4425-9f67-1713851fc1ca -    /dev/nbd13
> 29191 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-fd9a299f-dad9-4ab9-b6c9-2e9650cda581 -    /dev/nbd14
> 4493  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-1bbb4135-e9ed-4720-a41a-a49b998faf42 -    /dev/nbd15
> 4683  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-374cadac-d969-49eb-8269-aa125cba82d8 -    /dev/nbd16
> 1736  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-478a20cc-58dd-4cd9-b8b1-6198014e21b1 -    /dev/nbd17
> 3648  RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-6e28ec15-747a-43c9-998d-e9f2a600f266 -    /dev/nbd18
> 9993  RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-61ae5ef3-9efb-4fe6-8882-45d54558313e -    /dev/nbd19
> 10324 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-f7d27673-c268-47b9-bd58-46dcd4626bbb -    /dev/nbd20
> 19330 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-0d4e5568-ac93-4f27-b24f-6624f2fa4a2b -    /dev/nbd21
> 14942 RBD_XenStorage-PROD-SSD-1-cb933ab7-a006-4046-a012-5cbe0c5fbfb5 
> RBD-69832522-fd68-49f9-810f-485947ff5e44 -    /dev/nbd22
> 20859 RBD_XenStorage-PROD-HDD-2-08fdb4aa-81e3-433a-87d7-d5b37012a282 
> RBD-5025b066-723e-48f5-bc4e-9b8bdc1e9326 -    /dev/nbd23
> 19247 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-095292a0-6cc2-4112-95bf-15cb3dd33e9a -    /dev/nbd24
> 22356 RBD_XenStorage-PROD-SSD-2-edcf45e6-ca5b-43f9-bafe-c553b1e5dd84 
> RBD-f8229ea0-ad7b-4034-9cbe-7353792a2b7c -    /dev/nbd25
> 22537 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> RBD-e8c0b841-50ec-4765-a3cb-30c78a4b9162 -    /dev/nbd26
> 15105 RBD_XenStorage-PROD-HDD-1-2d80bec4-0f74-4553-9d87-5ccf650c87a0 
> 

[ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Xavier Trilla
Hi,

What would be the proper way to add 100 new OSDs to a cluster?

I have to add 100 new OSDs to our actual > 300 OSDs cluster, and I would like 
to know how you do it.

Usually, we add them quite slowly. Our cluster is a pure SSD/NVMe one, and it 
can handle plenty of load, but for the sake of safety -it hosts thousands of 
VMs via RBD- we usually add them one by one, waiting for a long time between 
adding each OSD.

Obviously this leads to PLENTY of data movement, as each time the cluster 
geometry changes, data is migrated among all the OSDs. But with the kind of 
load we have, if we add several OSDs at the same time, some PGs can get stuck 
for a while, while they peer to the new OSDs.

Now that I have to add > 100 new OSDs I was wondering if somebody has some 
suggestions.

Thanks!
Xavier.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to deal with slow requests related to OSD bugs

2019-07-24 Thread Xavier Trilla
Hi,

We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8 
cluster. Our cluster has > 300 OSDs based on SSDs and NVMe.

After adding a new OSD to the Ceph cluster one of the already running OSDs 
started to give us slow queries warnings.

We checked the OSD and it was working properly, nothing strange on the logs and 
also it has disk activity. Looks like it stopped serving requests just for one 
PG.

Request were just piling up, and the number of slow queries was just growing 
constantly till we restarted the OSD (All our OSDs are running bluestore).

We've been checking out everything in our setup, and everything is properly 
configured (This cluster has been running for >5 years and it hosts several 
thousand VMs.)

Beyond finding the real source of the issue -I guess I'll have to add more OSDs 
and if it happens again I could just dump the stats of the OSD (ceph daemon 
osd.X dump_historic_slow_ops) - what I would like to find is a way to protect 
the cluster from this kind of issues.

I mean, in some scenarios OSDs just suicide -actually I fixed the issue just 
restarting the offending OSD- but how can we deal with this kind of situation? 
I've been checking around, but I could not find anything (Obviously we could 
set our monitoring software to restart any OSD which has more than N slow 
queries, but I find that a little bit too aggressive).

Is there anything build in Ceph to deal with these situations? A OSD blocking 
queries in a RBD scenario is a big deal, as plenty of VMs will have disk 
timeouts which can lead to the VM just panicking.

Thanks!
Xavier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph durability during outages

2019-07-24 Thread Jean-Philippe Méthot
Hi,

I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This cluster 
is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 erasure 
code profile with a 3 copy metadata pool in front. The cluster runs well, but 
we recently had a short outage which triggered unexpected behaviour in the 
cluster.

I’ve always been under the impression that Ceph would continue working properly 
even if nodes would go down. I tested it several months ago with this 
configuration and it worked fine as long as only 2 nodes went down. However, 
this time, the first monitor as well as two osd nodes went down. As a result, 
Openstack VMs were able to mount their rbd volume but unable to read from it, 
even after the cluster had recovered with the following message : Reduced data 
availability: 599 pgs inactive, 599 pgs incomplete .

I believe the cluster should have continued to work properly despite the 
outage, so what could have prevented it from functioning? Is it because there 
was only two monitors remaining? Or is it that reduced data availability 
message? In that case, is my erasure coding configuration fine for that number 
of nodes?

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reproducable rbd-nbd crashes

2019-07-24 Thread Mike Christie
On 07/23/2019 12:28 AM, Marc Schöchlin wrote:
>>> For testing purposes i set the timeout to unlimited ("nbd_set_ioctl 
>>> /dev/nbd0 0", on already mounted device).
>>> >> I re-executed the problem procedure and discovered that the 
>>> >> compression-procedure crashes not at the same file, but crashes 30 
>>> >> seconds later with the same crash behavior.
>>> >>
>> > 0 will cause the default timeout of 30 secs to be used.
> Okay, then the usage description of 
> https://github.com/OnApp/nbd-kernel_mod/blob/master/nbd_set_timeout.c not 
> seems to be correct :-)

It is correct for older kernels:

With older kernels you could turn it off by setting it to 0, and it was
off by default.

With newer kernels, it's on by default, and there is no way to turn it off.

So with older kernels, you could have been hitting similar slow downs,
but you would have never seen the timeouts, io errors, etc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Sinan Polat
Hi,

Why not using backup tools that can do native OpenStack backups?

We are also using Ceph as the cinder backend on our OpenStack platform. We use 
CommVault to make our backups.

- Sinan

> Op 24 jul. 2019 om 17:48 heeft Wido den Hollander  het 
> volgende geschreven:
> 
> 
> 
>> On 7/24/19 4:06 PM, Fabian Niepelt wrote:
>> Hi, thanks for the reply.
>> 
>> Am Mittwoch, den 24.07.2019, 15:26 +0200 schrieb Wido den Hollander:
>>> 
>>> On 7/24/19 1:37 PM, Fabian Niepelt wrote:
 Hello ceph-users,
 
 I am currently building a Ceph cluster that will serve as a backend for
 Openstack and object storage using RGW. The cluster itself is finished and
 integrated with Openstack and virtual machines for testing are being
 deployed.
 Now I'm a bit stumped on how to effectively backup the Ceph pools.
 My requirements are two weekly backups, of which one must be offline after
 finishing backing up (systems turned powerless). We are expecting about
 250TB to
 500TB of data for now. The backups must protect against accidental pool
 deletion/corruption or widespread infection of a cryptovirus. In short:
 Complete
 data loss in the production Ceph cluster.
 
 At the moment, I am facing two issues:
 
 1. For the cinder pool, I looked into creating snapshots using the ceph CLI
 (so
 they don't turn up in Openstack and cannot be accidentally deleted by 
 users)
 and
 exporting their diffs. But volumes with snapshots created this way cannot 
 be
 removed from Openstack. Does anyone have an idea how to do this better?
>>> 
>>> You mean that while you leave the snapshot there OpenStack can't remove it?
>> 
>> Yes, that is correct. cinder-volume cannot remove a volume that still has a
>> snapshot. If the snapshot is created by openstack, it will remove the 
>> snapshot
>> before removing the volume. But snapshotting directly from ceph will forego
>> Openstack so it will never know about that snapshot's existence.
>> 
> 
> Ah, yes. That means you would need to remove it manually.
> 
 Alternatively, I could do a full export each week, but I am not sure if 
 that
 would be fast enough..
 
>>> 
>>> It probably won't, but the full backup is still the safest way imho.
>>> However: Does this scale?
>>> 
>>> You can export multiple RBD images in parallel and store them somewhere
>>> else, but it will still take a long time.
>>> 
>>> The export needs to be stored somewhere and then picked up. Or you could
>>> use some magic with Netcat to stream the RBD export to a destination host.
>>> 
>> 
>> Scaling is also my biggest worry about this.
>> 
 2. My search so far has only turned up backing up RBD pools, but how could 
 I
 backup the pools that are used for object storage?
 
>>> 
>>> Not easily. I think you mean RGW? You could try the RGW MultiSite, but
>>> it's difficult.
>>> 
>>> A complete DR with Ceph to restore it back to how it was at a given
>>> point in time is a challenge.
>>> 
>> 
>> Yes, I would like to backup the pools used by the RGW.
> 
> Not really an option. You would need to use the RGW MultiSite to
> replicate all data to a second environment.
> 
>> 
 Of course, I'm also open to completely other ideas on how to backup Ceph 
 and
 would appreciate hearing how you people are doing your backups.
>>> 
>>> A lot of time the backups are created inside the VMs on File level. And
>>> there is a second OpenStack+Ceph system which runs a mirror of the VMs
>>> or application. If one burns down it's not the end of the world.
>>> 
>>> Trying to backup a Ceph cluster sounds very 'enterprise' and is
>>> difficult to scale as well.
>>> 
>> 
>> Are those backups saved in Ceph as well? I cannot solely rely on Ceph 
>> because we
>> want to protect ourselves against failures in Ceph or a human accidentally or
>> maliciously deletes all pools.
>> 
> 
> That second OpenStack+Ceph environment is completely different. All the
> VMs are set up twice and using replication and backups on application
> level such things are redundant.
> 
> Think about MySQL replication for example.
> 
>> From what I'm reading, it seems to be better to maybe implement a backup
>> solution outside of Ceph that our Openstack users can use and not deal with
>> backing up Ceph at all, except its configs to get it running after total
>> desaster...
>> 
> 
> You could backup OpenStack's MySQL database, the ceph config and then
> backup the data inside the VMs.
> 
> It's very difficult to backup data for DR to a certain point of time
> when you go into the >100TB scale.
> 
> Wido
> 
>>> Wido
>>> 
 Any help is much appreciated.
 
 Greetings
 Fabian
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Wido den Hollander



On 7/24/19 4:06 PM, Fabian Niepelt wrote:
> Hi, thanks for the reply.
> 
> Am Mittwoch, den 24.07.2019, 15:26 +0200 schrieb Wido den Hollander:
>>
>> On 7/24/19 1:37 PM, Fabian Niepelt wrote:
>>> Hello ceph-users,
>>>
>>> I am currently building a Ceph cluster that will serve as a backend for
>>> Openstack and object storage using RGW. The cluster itself is finished and
>>> integrated with Openstack and virtual machines for testing are being
>>> deployed.
>>> Now I'm a bit stumped on how to effectively backup the Ceph pools.
>>> My requirements are two weekly backups, of which one must be offline after
>>> finishing backing up (systems turned powerless). We are expecting about
>>> 250TB to
>>> 500TB of data for now. The backups must protect against accidental pool
>>> deletion/corruption or widespread infection of a cryptovirus. In short:
>>> Complete
>>> data loss in the production Ceph cluster.
>>>
>>> At the moment, I am facing two issues:
>>>
>>> 1. For the cinder pool, I looked into creating snapshots using the ceph CLI
>>> (so
>>> they don't turn up in Openstack and cannot be accidentally deleted by users)
>>> and
>>> exporting their diffs. But volumes with snapshots created this way cannot be
>>> removed from Openstack. Does anyone have an idea how to do this better?
>>
>> You mean that while you leave the snapshot there OpenStack can't remove it?
> 
> Yes, that is correct. cinder-volume cannot remove a volume that still has a
> snapshot. If the snapshot is created by openstack, it will remove the snapshot
> before removing the volume. But snapshotting directly from ceph will forego
> Openstack so it will never know about that snapshot's existence.
> 

Ah, yes. That means you would need to remove it manually.

>>> Alternatively, I could do a full export each week, but I am not sure if that
>>> would be fast enough..
>>>
>>
>> It probably won't, but the full backup is still the safest way imho.
>> However: Does this scale?
>>
>> You can export multiple RBD images in parallel and store them somewhere
>> else, but it will still take a long time.
>>
>> The export needs to be stored somewhere and then picked up. Or you could
>> use some magic with Netcat to stream the RBD export to a destination host.
>>
> 
> Scaling is also my biggest worry about this.
> 
>>> 2. My search so far has only turned up backing up RBD pools, but how could I
>>> backup the pools that are used for object storage?
>>>
>>
>> Not easily. I think you mean RGW? You could try the RGW MultiSite, but
>> it's difficult.
>>
>> A complete DR with Ceph to restore it back to how it was at a given
>> point in time is a challenge.
>>
> 
> Yes, I would like to backup the pools used by the RGW.

Not really an option. You would need to use the RGW MultiSite to
replicate all data to a second environment.

> 
>>> Of course, I'm also open to completely other ideas on how to backup Ceph and
>>> would appreciate hearing how you people are doing your backups.
>>
>> A lot of time the backups are created inside the VMs on File level. And
>> there is a second OpenStack+Ceph system which runs a mirror of the VMs
>> or application. If one burns down it's not the end of the world.
>>
>> Trying to backup a Ceph cluster sounds very 'enterprise' and is
>> difficult to scale as well.
>>
> 
> Are those backups saved in Ceph as well? I cannot solely rely on Ceph because 
> we
> want to protect ourselves against failures in Ceph or a human accidentally or
> maliciously deletes all pools.
> 

That second OpenStack+Ceph environment is completely different. All the
VMs are set up twice and using replication and backups on application
level such things are redundant.

Think about MySQL replication for example.

> From what I'm reading, it seems to be better to maybe implement a backup
> solution outside of Ceph that our Openstack users can use and not deal with
> backing up Ceph at all, except its configs to get it running after total
> desaster...
> 

You could backup OpenStack's MySQL database, the ceph config and then
backup the data inside the VMs.

It's very difficult to backup data for DR to a certain point of time
when you go into the >100TB scale.

Wido

>> Wido
>>
>>> Any help is much appreciated.
>>>
>>> Greetings
>>> Fabian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Janek Bevendorff



So it looks like the problem only occurs with the kernel module, but 
maybe ceph-fuse is just too slow to tell. In fact, it is a magnitude 
slower. I only get 1.3k reqs/s compared to the 20k req/s with the 
kernel module, which is not practical at all.


Update 2: it does indeed seem like ceph-fuse is just too slow. After 
continuously copying to the ceph-fuse mount for some more time, the 
inode number has slowly started crawling upwards again, now at 1M.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Paul Emmerich
Note that enabling rbd mirroring means taking a hit on IOPS performance,
just think of it as a x2 overhead mainly on IOPS.
But it does work very well for disaster recovery scenarios if you can take
the performance hit.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Wed, Jul 24, 2019 at 4:18 PM Tobias Gall 
wrote:

> Hallo,
>
> what about RGW Replication:
>
> https://ceph.com/geen-categorie/radosgw-simple-replication-example/
> http://docs.ceph.com/docs/master/radosgw/multisite/
>
> or rdb-mirroring:
>
> http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
>
> Regards,
> Tobias
>
> Am 24.07.19 um 13:37 schrieb Fabian Niepelt:
> > Hello ceph-users,
> >
> > I am currently building a Ceph cluster that will serve as a backend for
> > Openstack and object storage using RGW. The cluster itself is finished
> and
> > integrated with Openstack and virtual machines for testing are being
> deployed.
> > Now I'm a bit stumped on how to effectively backup the Ceph pools.
> > My requirements are two weekly backups, of which one must be offline
> after
> > finishing backing up (systems turned powerless). We are expecting about
> 250TB to
> > 500TB of data for now. The backups must protect against accidental pool
> > deletion/corruption or widespread infection of a cryptovirus. In short:
> Complete
> > data loss in the production Ceph cluster.
> >
> > At the moment, I am facing two issues:
> >
> > 1. For the cinder pool, I looked into creating snapshots using the ceph
> CLI (so
> > they don't turn up in Openstack and cannot be accidentally deleted by
> users) and
> > exporting their diffs. But volumes with snapshots created this way
> cannot be
> > removed from Openstack. Does anyone have an idea how to do this better?
> > Alternatively, I could do a full export each week, but I am not sure if
> that
> > would be fast enough..
> >
> > 2. My search so far has only turned up backing up RBD pools, but how
> could I
> > backup the pools that are used for object storage?
> >
> > Of course, I'm also open to completely other ideas on how to backup Ceph
> and
> > would appreciate hearing how you people are doing your backups.
> >
> > Any help is much appreciated.
> >
> > Greetings
> > Fabian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Tobias Gall
> Facharbeitsgruppe Datenkommunikation
> Universitätsrechenzentrum
>
> Technische Universität Chemnitz
> Straße der Nationen 62 | R. B302A
> 09111 Chemnitz
> Germany
>
> Tel:+49 371 531-33617
> Fax:+49 371 531-833617
>
> tobias.g...@hrz.tu-chemnitz.de
> www.tu-chemnitz.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Janek Bevendorff
Update: I had to wipe my CephFS, because after I increased the beacon 
grace period on the last attempt, I couldn't get the MDSs to rejoin 
anymore at all without running out of memory on the machine. I tried 
wiping all sessions and the journal, but it didn't work. In the end all 
I achieved was that the daemons crashed right after starting with some 
assertion error. So now I have a fresh CephFS and will try to copy the 
data from scratch.



On 24.07.19 15:36, Feng Zhang wrote:

Does Ceph-fuse mount also has the same issue?

That's hard to say. I started with the kernel module and saw the same 
behaviour again. I got to 930k inodes after only two minutes and stopped 
there. Since then, this number has not gone back down, not even after I 
disconnected all clients. I retried the same with ceph-fuse and the 
number did not increase any further (although it did not decrease 
either). When I unmounted the share and remounted it with the kernel 
module again, the number rose to 948k almost immediately.


So it looks like the problem only occurs with the kernel module, but 
maybe ceph-fuse is just too slow to tell. In fact, it is a magnitude 
slower. I only get 1.3k reqs/s compared to the 20k req/s with the kernel 
module, which is not practical at all.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Tobias Gall

Hallo,

what about RGW Replication:

https://ceph.com/geen-categorie/radosgw-simple-replication-example/
http://docs.ceph.com/docs/master/radosgw/multisite/

or rdb-mirroring:

http://docs.ceph.com/docs/master/rbd/rbd-mirroring/

Regards,
Tobias

Am 24.07.19 um 13:37 schrieb Fabian Niepelt:

Hello ceph-users,

I am currently building a Ceph cluster that will serve as a backend for
Openstack and object storage using RGW. The cluster itself is finished and
integrated with Openstack and virtual machines for testing are being deployed.
Now I'm a bit stumped on how to effectively backup the Ceph pools.
My requirements are two weekly backups, of which one must be offline after
finishing backing up (systems turned powerless). We are expecting about 250TB to
500TB of data for now. The backups must protect against accidental pool
deletion/corruption or widespread infection of a cryptovirus. In short: Complete
data loss in the production Ceph cluster.

At the moment, I am facing two issues:

1. For the cinder pool, I looked into creating snapshots using the ceph CLI (so
they don't turn up in Openstack and cannot be accidentally deleted by users) and
exporting their diffs. But volumes with snapshots created this way cannot be
removed from Openstack. Does anyone have an idea how to do this better?
Alternatively, I could do a full export each week, but I am not sure if that
would be fast enough..

2. My search so far has only turned up backing up RBD pools, but how could I
backup the pools that are used for object storage?

Of course, I'm also open to completely other ideas on how to backup Ceph and
would appreciate hearing how you people are doing your backups.

Any help is much appreciated.

Greetings
Fabian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Tobias Gall
Facharbeitsgruppe Datenkommunikation
Universitätsrechenzentrum

Technische Universität Chemnitz
Straße der Nationen 62 | R. B302A
09111 Chemnitz
Germany

Tel:+49 371 531-33617
Fax:+49 371 531-833617

tobias.g...@hrz.tu-chemnitz.de
www.tu-chemnitz.de



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-24 Thread vitalif

We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
rather harsh, even for EC.


4kb IO is slow in Ceph even without EC. Your 3 GB/s linear writes don't 
matter anything. Ceph adds a significant overhead to each operation.


From my observations 4kb random write throughput with iodepth=128 in a 
full-flash cluster is only ~30% lower with EC 2+1 than with 3 replicas.


With iodepth=1 and in an HDD+SSD setup it's worse: I get 100-120 write 
iops with EC and 500+ write iops with 3 replicas. I guess this is 
because in a replicated pool Ceph can just write new block to the 
deferred write queue and with EC it must first read the corresponding 
block from another OSD, and SSD journal doesn't help reads. But I don't 
remember exact test results for iodepth=128...


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Fabian Niepelt
Hi, thanks for the reply.

Am Mittwoch, den 24.07.2019, 15:26 +0200 schrieb Wido den Hollander:
> 
> On 7/24/19 1:37 PM, Fabian Niepelt wrote:
> > Hello ceph-users,
> > 
> > I am currently building a Ceph cluster that will serve as a backend for
> > Openstack and object storage using RGW. The cluster itself is finished and
> > integrated with Openstack and virtual machines for testing are being
> > deployed.
> > Now I'm a bit stumped on how to effectively backup the Ceph pools.
> > My requirements are two weekly backups, of which one must be offline after
> > finishing backing up (systems turned powerless). We are expecting about
> > 250TB to
> > 500TB of data for now. The backups must protect against accidental pool
> > deletion/corruption or widespread infection of a cryptovirus. In short:
> > Complete
> > data loss in the production Ceph cluster.
> > 
> > At the moment, I am facing two issues:
> > 
> > 1. For the cinder pool, I looked into creating snapshots using the ceph CLI
> > (so
> > they don't turn up in Openstack and cannot be accidentally deleted by users)
> > and
> > exporting their diffs. But volumes with snapshots created this way cannot be
> > removed from Openstack. Does anyone have an idea how to do this better?
> 
> You mean that while you leave the snapshot there OpenStack can't remove it?

Yes, that is correct. cinder-volume cannot remove a volume that still has a
snapshot. If the snapshot is created by openstack, it will remove the snapshot
before removing the volume. But snapshotting directly from ceph will forego
Openstack so it will never know about that snapshot's existence.

> > Alternatively, I could do a full export each week, but I am not sure if that
> > would be fast enough..
> > 
> 
> It probably won't, but the full backup is still the safest way imho.
> However: Does this scale?
> 
> You can export multiple RBD images in parallel and store them somewhere
> else, but it will still take a long time.
> 
> The export needs to be stored somewhere and then picked up. Or you could
> use some magic with Netcat to stream the RBD export to a destination host.
> 

Scaling is also my biggest worry about this.

> > 2. My search so far has only turned up backing up RBD pools, but how could I
> > backup the pools that are used for object storage?
> > 
> 
> Not easily. I think you mean RGW? You could try the RGW MultiSite, but
> it's difficult.
> 
> A complete DR with Ceph to restore it back to how it was at a given
> point in time is a challenge.
> 

Yes, I would like to backup the pools used by the RGW.

> > Of course, I'm also open to completely other ideas on how to backup Ceph and
> > would appreciate hearing how you people are doing your backups.
> 
> A lot of time the backups are created inside the VMs on File level. And
> there is a second OpenStack+Ceph system which runs a mirror of the VMs
> or application. If one burns down it's not the end of the world.
> 
> Trying to backup a Ceph cluster sounds very 'enterprise' and is
> difficult to scale as well.
> 

Are those backups saved in Ceph as well? I cannot solely rely on Ceph because we
want to protect ourselves against failures in Ceph or a human accidentally or
maliciously deletes all pools.

>From what I'm reading, it seems to be better to maybe implement a backup
solution outside of Ceph that our Openstack users can use and not deal with
backing up Ceph at all, except its configs to get it running after total
desaster...

> Wido
> 
> > Any help is much appreciated.
> > 
> > Greetings
> > Fabian
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Marc Roos
 

>
> complete DR with Ceph to restore it back to how it was at a given 
point in time is a challenge.

>
> Trying to backup a Ceph cluster sounds very 'enterprise' and is 
difficult to scale as well.

Hmmm, I was actually also curious how backups were done, especially on 
these clusters that have 300, 600 or even more osd's. 

Is it common to do backups 'within' the ceph cluster, eg with snapshots? 
Different pool? If this is common with most ceph clusters, would that 
not also increase security requirements for the development? 

Recently I read a cloud hosting provider Insynq lost all their data[0] 
One remote exploit in ceph, and TB/PB of data could be at risk. 



[0]
https://www.insynq.com/support/#status


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Feng Zhang
Does Ceph-fuse mount also has the same issue?

On Wed, Jul 24, 2019 at 3:35 AM Janek Bevendorff
 wrote:
>
>
> I mean kernel version
>
> Oh, of course. 4.15.0-54 on Ubuntu 18.04 LTS.
>
> Right now I am also experiencing a different phenomenon. Since I wrapped it 
> up yesterday, the MDS machines have been trying to rejoin, but could only 
> handle a few hundred up to a few hundred thousand inodes per second before 
> crashing.
>
> I had a look at the machines and the daemons had trouble allocating memory. 
> There weren't many processes running and none of them consumed more than 5GB, 
> yet all 128 GB were used (and not freeable, so it wasn't just the page 
> cache). Thus I suppose there must also be a memory leak somewhere. No running 
> process had this much memory allocated, so it must have been allocated from 
> kernel space. I am rebooting the machines right now as a last resort.
>
>
>
>>
>> try mounting cephfs on a machine/vm with small memory (4G~8G), then rsync 
>> your date into mount point of that machine.
>>
>> I could try running it in a memory-limited Docker container, but isn't there 
>> a better way to achieve the same thing? This sounds like a bug to me. A 
>> client having too much memory and failing to free its capabilities shouldn't 
>> crash the server. If the server decides to drop things from its cache, the 
>> client has to deal with it.
>>
>> Also in the long run, limiting the client's memory isn't a practical 
>> solution. We are planning to use the CephFS from our compute cluster, whose 
>> nodes have (and need) many more times the RAM that our storage servers have.
>>
>>
>>
>>>
>>>
>>>
>>>
>>> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
>>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
>>> While MDSs are trying to rejoin, they tend to saturate a single thread
>>> shortly, but nothing spectacular. During normal operation, none of the
>>> cores is particularly under load.
>>>
>>> > While migrating to a Nautilus cluster recently, we had up to 14
>>> > million inodes open, and we increased the cache limit to 16GiB. Other
>>> > than warnings about oversized cache, this caused no issues.
>>>
>>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
>>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
>>> without being kicked again after a few seconds), it did not change much
>>> in terms of the actual problem. Right now I can change it to whatever I
>>> want, it doesn't do anything, because rank 0 keeps being trashed anyway
>>> (the other ranks are fine, but the CephFS is down anyway). Is there
>>> anything useful I can give you to debug this? Otherwise I would try
>>> killing the MDS daemons so I can at least restore the CephFS to a
>>> semi-operational state.
>>>
>>>
>>> >
>>> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>>> >> Hi,
>>> >>
>>> >> Disclaimer: I posted this before to the cheph.io mailing list, but from
>>> >> the answers I didn't get and a look at the archives, I concluded that
>>> >> that list is very dead. So apologies if anyone has read this before.
>>> >>
>>> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in
>>> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
>>> >> trying to copy is about 23GB, so it's a lot of files. I am copying them
>>> >> in batches of 25k using 16 parallel rsync processes over a 10G link.
>>> >>
>>> >> I started out with 5 MDSs / 2 active, but had repeated issues with
>>> >> immense and growing cache sizes far beyond the theoretical maximum of
>>> >> 400k inodes which the 16 rsync processes could keep open at the same
>>> >> time. The usual inode count was between 1 and 4 million and the cache
>>> >> size between 20 and 80GB on average.
>>> >>
>>> >> After a while, the MDSs started failing under this load by either
>>> >> crashing or being kicked from the quorum. I tried increasing the max
>>> >> cache size, max log segments, and beacon grace period, but to no avail.
>>> >> A crashed MDS often needs minutes to rejoin.
>>> >>
>>> >> The MDSs fail with the following message:
>>> >>
>>> >>-21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
>>> >> 'MDSRank' had timed out after 15
>>> >>-20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
>>> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
>>> >> heartbeat is not healthy!
>>> >>
>>> >> I found the following thread, which seems to be about the same general
>>> >> issue:
>>> >>
>>> >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>>> >>
>>> >> Unfortunately, it does not really contain a solution except things I
>>> >> have tried already. Though it does give some explanation as to why the
>>> >> MDSs pile up so many open inodes. It appears like Ceph can't handle many
>>> >> (write-only) operations on different files, since the clients keep their

Re: [ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block

Thank you very much!


Zitat von EDH - Manuel Rios Fernandez :


Hi Eugen,

Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2

Regards,
Manuel

-Mensaje original-
De: ceph-users  En nombre de Eugen Block
Enviado el: miércoles, 24 de julio de 2019 15:10
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Nautilus dashboard: crushmap viewer shows only first
root

Hi all,

we just upgraded our cluster to:

ceph version 14.2.0-300-gacd2f2b9e1
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)

When clicking through the dashboard to see what's new we noticed that the
crushmap viewer only shows the first root of our crushmap (we have two
roots). I couldn't find anything in the tracker, and I can't update further
to the latest release 14.2.2 to see if that has been resolved. Is this known
or already fixed?

Regards
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Wido den Hollander



On 7/24/19 1:37 PM, Fabian Niepelt wrote:
> Hello ceph-users,
> 
> I am currently building a Ceph cluster that will serve as a backend for
> Openstack and object storage using RGW. The cluster itself is finished and
> integrated with Openstack and virtual machines for testing are being deployed.
> Now I'm a bit stumped on how to effectively backup the Ceph pools.
> My requirements are two weekly backups, of which one must be offline after
> finishing backing up (systems turned powerless). We are expecting about 250TB 
> to
> 500TB of data for now. The backups must protect against accidental pool
> deletion/corruption or widespread infection of a cryptovirus. In short: 
> Complete
> data loss in the production Ceph cluster.
> 
> At the moment, I am facing two issues:
> 
> 1. For the cinder pool, I looked into creating snapshots using the ceph CLI 
> (so
> they don't turn up in Openstack and cannot be accidentally deleted by users) 
> and
> exporting their diffs. But volumes with snapshots created this way cannot be
> removed from Openstack. Does anyone have an idea how to do this better?

You mean that while you leave the snapshot there OpenStack can't remove it?

> Alternatively, I could do a full export each week, but I am not sure if that
> would be fast enough..
> 

It probably won't, but the full backup is still the safest way imho.
However: Does this scale?

You can export multiple RBD images in parallel and store them somewhere
else, but it will still take a long time.

The export needs to be stored somewhere and then picked up. Or you could
use some magic with Netcat to stream the RBD export to a destination host.

> 2. My search so far has only turned up backing up RBD pools, but how could I
> backup the pools that are used for object storage?
> 

Not easily. I think you mean RGW? You could try the RGW MultiSite, but
it's difficult.

A complete DR with Ceph to restore it back to how it was at a given
point in time is a challenge.

> Of course, I'm also open to completely other ideas on how to backup Ceph and
> would appreciate hearing how you people are doing your backups.

A lot of time the backups are created inside the VMs on File level. And
there is a second OpenStack+Ceph system which runs a mirror of the VMs
or application. If one burns down it's not the end of the world.

Trying to backup a Ceph cluster sounds very 'enterprise' and is
difficult to scale as well.

Wido

> 
> Any help is much appreciated.
> 
> Greetings
> Fabian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread EDH - Manuel Rios Fernandez
Hi Eugen,

Yes its solved, we reported in 14.2.1 and team fixed in 14.2.2

Regards,
Manuel

-Mensaje original-
De: ceph-users  En nombre de Eugen Block
Enviado el: miércoles, 24 de julio de 2019 15:10
Para: ceph-users@lists.ceph.com
Asunto: [ceph-users] Nautilus dashboard: crushmap viewer shows only first
root

Hi all,

we just upgraded our cluster to:

ceph version 14.2.0-300-gacd2f2b9e1
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)

When clicking through the dashboard to see what's new we noticed that the
crushmap viewer only shows the first root of our crushmap (we have two
roots). I couldn't find anything in the tracker, and I can't update further
to the latest release 14.2.2 to see if that has been resolved. Is this known
or already fixed?

Regards
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Nautilus dashboard: crushmap viewer shows only first root

2019-07-24 Thread Eugen Block

Hi all,

we just upgraded our cluster to:

ceph version 14.2.0-300-gacd2f2b9e1  
(acd2f2b9e196222b0350b3b59af9981f91706c7f) nautilus (stable)


When clicking through the dashboard to see what's new we noticed that  
the crushmap viewer only shows the first root of our crushmap (we have  
two roots). I couldn't find anything in the tracker, and I can't  
update further to the latest release 14.2.2 to see if that has been  
resolved. Is this known or already fixed?


Regards
Eugen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Questions regarding backing up Ceph

2019-07-24 Thread Fabian Niepelt
Hello ceph-users,

I am currently building a Ceph cluster that will serve as a backend for
Openstack and object storage using RGW. The cluster itself is finished and
integrated with Openstack and virtual machines for testing are being deployed.
Now I'm a bit stumped on how to effectively backup the Ceph pools.
My requirements are two weekly backups, of which one must be offline after
finishing backing up (systems turned powerless). We are expecting about 250TB to
500TB of data for now. The backups must protect against accidental pool
deletion/corruption or widespread infection of a cryptovirus. In short: Complete
data loss in the production Ceph cluster.

At the moment, I am facing two issues:

1. For the cinder pool, I looked into creating snapshots using the ceph CLI (so
they don't turn up in Openstack and cannot be accidentally deleted by users) and
exporting their diffs. But volumes with snapshots created this way cannot be
removed from Openstack. Does anyone have an idea how to do this better?
Alternatively, I could do a full export each week, but I am not sure if that
would be fast enough..

2. My search so far has only turned up backing up RBD pools, but how could I
backup the pools that are used for object storage?

Of course, I'm also open to completely other ideas on how to backup Ceph and
would appreciate hearing how you people are doing your backups.

Any help is much appreciated.

Greetings
Fabian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The num of bandwidth while ceph recovering stands for?

2019-07-24 Thread 展荣臻(信泰)
Hello all,
  I have a bit confusion about the num of bandwidth while ceph recovering.
Below is the result of ceph -w display while ceph recovering:


2019-07-22 18:30:20.378134 mon.0 [INF] pgmap v54047611: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 3928 B/s rd, 2918 kB/s wr, 682 op/s; 67455/2733834 objects 
misplaced (2.467%); 159 MB/s, 71 objects/s recovering
2019-07-22 18:30:21.441233 mon.0 [INF] pgmap v54047612: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 7818 B/s rd, 4111 kB/s wr, 1042 op/s; 67370/2733834 
objects misplaced (2.464%); 272 MB/s, 105 objects/s recovering
2019-07-22 18:30:22.501474 mon.0 [INF] pgmap v54047613: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 5935 B/s rd, 5733 kB/s wr, 1450 op/s; 67370/2733834 
objects misplaced (2.464%); 115 MB/s, 35 objects/s recovering
2019-07-22 18:30:23.514004 mon.0 [INF] pgmap v54047614: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 1940 B/s rd, 5374 kB/s wr, 1213 op/s; 67370/2733834 
objects misplaced (2.464%)
2019-07-22 18:30:24.575107 mon.0 [INF] pgmap v54047615: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 1964 B/s rd, 2921 kB/s wr, 639 op/s; 67360/2733834 objects 
misplaced (2.464%); 5892 kB/s, 4 objects/s recovering
2019-07-22 18:30:25.627126 mon.0 [INF] pgmap v54047616: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 3949 B/s rd, 3816 kB/s wr, 810 op/s; 67280/2733834 objects 
misplaced (2.461%); 121 MB/s, 44 objects/s recovering
2019-07-22 18:30:26.633295 mon.0 [INF] pgmap v54047617: 704 pgs: 9 
active+remapped+backfilling, 695 active+clean; 3847 GB data, 7742 GB used, 5049 
GB / 12792 GB avail; 11699 B/s rd, 4627 kB/s wr, 744 op/s; 67230/2733834 
objects misplaced (2.459%); 207 MB/s, 70 objects/s recovering


I calculated from the ceph -w that the average recovering bandwidth is 152MB/s 
and The total recovery is 143GB,while the amount calculated by another method 
is 94GB. I want to know the num of bandwidth in recovering filed stand for 
what.it contains read and write? My ceph is hammer.
Thanks!!!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug slow requests

2019-07-24 Thread Massimo Sgaravatto
Just so I understand, the duration for this operation is 329 seconds (a lot
!) but all the reported events happened ~ at the same time (2019-07-20
23:13:18)
Were all the events of this ops reported ?

Why do you see a problem with the "waiting for subops from 4" event ?

Thanks, Massimo

On Wed, Jul 24, 2019 at 9:31 AM Wido den Hollander  wrote:

>
>
> On 7/20/19 6:06 PM, Wei Zhao wrote:
> > Hi ceph users:
> > I was doing  write benchmark, and found some io will be blocked for a
> > very long time. The following log is one op , it seems to wait for
> > replica to finish. My ceph version is 12.2.4, and the pool is 3+2 EC .
> > Does anyone give me some adives about how I sould debug next ?
> >
> > {
> > "ops": [
> > {
> > "description": "osd_op(client.17985.0:670679 39.18
> >
> 39:1a63fc5c:::benchmark_data_SH-IDC1-10-5-37-174_2917453_object670678:head
> > [set-alloc-hint object_size 1048576 write_size 1048576,write
> > 0~1048576] snapc 0=[] ondisk+write+known_if_redirected e1135)",
> > "initiated_at": "2019-07-20 23:13:18.725466",
> > "age": 329.248875,
> > "duration": 329.248901,
> > "type_data": {
> > "flag_point": "waiting for sub ops",
> > "client_info": {
> > "client": "client.17985",
> > "client_addr": "10.5.137.174:0/1544466091",
> > "tid": 670679
> > },
> > "events": [
> > {
> > "time": "2019-07-20 23:13:18.725466",
> > "event": "initiated"
> > },
> > {
> > "time": "2019-07-20 23:13:18.726585",
> > "event": "queued_for_pg"
> > },
> > {
> > "time": "2019-07-20 23:13:18.726606",
> > "event": "reached_pg"
> > },
> > {
> > "time": "2019-07-20 23:13:18.726752",
> > "event": "started"
> > },
> > {
> > "time": "2019-07-20 23:13:18.726842",
> > "event": "waiting for subops from 4"
> > },
>
> This usually indicates there is something going on with osd.4
>
> I would go and see if osd.4 is very busy at that moment and check if the
> disk is 100% busy in iostat.
>
> It could be a number of things, but I would check that first.
>
> Wido
>
> > {
> > "time": "2019-07-20 23:13:18.743134",
> > "event": "op_commit"
> > },
> > {
> > "time": "2019-07-20 23:13:18.743137",
> > "event": "op_applied"
> > }
> > ]
> > }
> > },
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD replacement causes slow requests

2019-07-24 Thread Eugen Block

Hi Wido,

thanks for your response.


Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?
$ ceph daemon osd.X dump_historic_slow_ops


Good question, I don't recall doing that. Maybe my colleague did but  
he's on vacation right now. ;-)



But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?


I'll try to clarify: it was (and still is) a mixture of L and N OSDs,  
but all L-OSDs were empty at the time. The cluster already had  
rebalanced all PGs to the new OSDs. So the L-OSDs were not involved in  
this recovery process. We're currently upgrading the remaining servers  
to Nautilus, there's one left with L-OSDs, but those OSDs don't store  
any objects at the moment (different root in crushmap).


The recovery eventually finished successfully, but my colleague had to  
do it after business hours, maybe that's why he needs his vacation. ;-)


Regards,
Eugen


Zitat von Wido den Hollander :


On 7/18/19 12:21 PM, Eugen Block wrote:

Hi list,

we're facing an unexpected recovery behavior of an upgraded cluster
(Luminous -> Nautilus).

We added new servers with Nautilus to the existing Luminous cluster, so
we could first replace the MONs step by step. Then we moved the old
servers to a new root in the crush map and then added the new OSDs to
the default root so we would need to rebalance the data only once. This
almost worked as planned, except for many slow and stuck requests. We
did this after business hours so the impact was negligable and we didn't
really investigate, the goal was to finish the rebalancing.

But only after two days one of the new OSDs (osd.30) already reported
errors, so we need to replace that disk.
The replacement disk (osd.0) has been added with an initial crush weight
of 0 (also reweight 0) to control the backfill with small steps.
This seems to be harder than it should (also than we experienced so
far), no matter how small the steps are, the cluster immediately reports
slow requests. We can't disrupt the production environment so we
cancelled the backfill/recovery for now. But this procedure has been
successful in the past with Luminous, that's why we're so surprised.

The recovery and backfill parameters are pretty low:

    "osd_max_backfills": "1",
    "osd_recovery_max_active": "3",

This usually allowed us a slow backfill to be able to continue
productive work, now it doesn't.

Our ceph version is (only the active MDS still runs Luminous, the
designated server is currently being upgraded):

14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)
nautilus (stable)

Is there anything we missed that we should be aware of in Nautilus
regarding recovery and replacement scenarios?
We couldn't reduce the weight of that osd lower than 0.16, everything
else results in slow requests.
During the weight reduction several PGs keep stuck in
activating+remapped state when, only recoverable (sometimes) by
restarting that affected osd several times. Reducing crush weight leads
to the same effect.

Please note: the old servers in root-ec are going to be ec-only OSDs,
that's why they're still in the cluster.

Any pointers to what goes wrong here would be highly appreciated! If you
need any other information I'd be happy to provide it.



Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?

$ ceph daemon osd.X dump_historic_slow_ops

But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?

Wido


Best regards,
Eugen


This is our osd tree:

ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
-19   11.09143 root root-ec
 -2    5.54572 host ceph01
  1   hdd  0.92429 osd.1   down    0 1.0
  4   hdd  0.92429 osd.4 up    0 1.0
  6   hdd  0.92429 osd.6 up    0 1.0
 13   hdd  0.92429 osd.13    up    0 1.0
 16   hdd  0.92429 osd.16    up    0 1.0
 18   hdd  0.92429 osd.18    up    0 1.0
 -3    5.54572 host ceph02
  2   hdd  0.92429 osd.2 up    0 1.0
  5   hdd  0.92429 osd.5 up    0 1.0
  7   hdd  0.92429 osd.7 up    0 1.0
 12   hdd  0.92429 osd.12    up    0 1.0
 17   hdd  0.92429 osd.17    up    0 1.0
 19   hdd  0.92429 osd.19    up    0 1.0
 -5  0 host ceph03
 -1   38.32857 root default
-31   10.79997 host ceph04
 25   hdd  3.5 osd.25    up  1.0 1.0
 26   hdd  3.5 osd.26    up  1.0 1.0
 27   hdd  3.5 osd.27    up  1.0 1.0
-34   14.39995 host ceph05
  0   hdd  3.59998 osd.0 up    0 1.0
 28   hdd  3.5 osd.28    up  1.0 1.0
 29   hdd  3.5 osd.29   

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Janek Bevendorff



I mean kernel version


Oh, of course. 4.15.0-54 on Ubuntu 18.04 LTS.

Right now I am also experiencing a different phenomenon. Since I wrapped 
it up yesterday, the MDS machines have been trying to rejoin, but could 
only handle a few hundred up to a few hundred thousand inodes per second 
before crashing.


I had a look at the machines and the daemons had trouble allocating 
memory. There weren't many processes running and none of them consumed 
more than 5GB, yet all 128 GB were used (and not freeable, so it wasn't 
just the page cache). Thus I suppose there must also be a memory leak 
somewhere. No running process had this much memory allocated, so it must 
have been allocated from kernel space. I am rebooting the machines right 
now as a last resort.




try mounting cephfs on a machine/vm with small memory (4G~8G),
then rsync your date into mount point of that machine.


I could try running it in a memory-limited Docker container, but
isn't there a better way to achieve the same thing? This sounds
like a bug to me. A client having too much memory and failing to
free its capabilities shouldn't crash the server. If the server
decides to drop things from its cache, the client has to deal with it.

Also in the long run, limiting the client's memory isn't a
practical solution. We are planning to use the CephFS from our
compute cluster, whose nodes have (and need) many more times the
RAM that our storage servers have.




The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz
with 32 threads (Dual
CPU with 8 physical cores each) and 128GB RAM. CPU
usage is rather mild.
While MDSs are trying to rejoin, they tend to
saturate a single thread
shortly, but nothing spectacular. During normal
operation, none of the
cores is particularly under load.

> While migrating to a Nautilus cluster recently, we
had up to 14
> million inodes open, and we increased the cache
limit to 16GiB. Other
> than warnings about oversized cache, this caused no
issues.

I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB.
Other than getting
rid of the cache size warnings (and sometimes
allowing an MDS to rejoin
without being kicked again after a few seconds), it
did not change much
in terms of the actual problem. Right now I can
change it to whatever I
want, it doesn't do anything, because rank 0 keeps
being trashed anyway
(the other ranks are fine, but the CephFS is down
anyway). Is there
anything useful I can give you to debug this?
Otherwise I would try
killing the MDS daemons so I can at least restore the
CephFS to a
semi-operational state.


>
> On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>> Hi,
>>
>> Disclaimer: I posted this before to the cheph.io
 mailing list, but from
>> the answers I didn't get and a look at the
archives, I concluded that
>> that list is very dead. So apologies if anyone has
read this before.
>>
>> I am trying to copy our storage server to a
CephFS. We have 5 MONs in
>> our cluster and (now) 7 MDS with max_mds = 4. The
list (!) of files I am
>> trying to copy is about 23GB, so it's a lot of
files. I am copying them
>> in batches of 25k using 16 parallel rsync
processes over a 10G link.
>>
>> I started out with 5 MDSs / 2 active, but had
repeated issues with
>> immense and growing cache sizes far beyond the
theoretical maximum of
>> 400k inodes which the 16 rsync processes could
keep open at the same
>> time. The usual inode count was between 1 and 4
million and the cache
>> size between 20 and 80GB on average.
>>
>> After a while, the MDSs started failing under this
load by either
>> crashing or being kicked from the quorum. I tried
increasing the max
>> cache size, max log segments, and beacon grace
period, but to no avail.
>> A crashed MDS often needs minutes to rejoin.
>>
>> The MDSs fail with the following message:
 

Re: [ceph-users] OSD replacement causes slow requests

2019-07-24 Thread Wido den Hollander


On 7/18/19 12:21 PM, Eugen Block wrote:
> Hi list,
> 
> we're facing an unexpected recovery behavior of an upgraded cluster
> (Luminous -> Nautilus).
> 
> We added new servers with Nautilus to the existing Luminous cluster, so
> we could first replace the MONs step by step. Then we moved the old
> servers to a new root in the crush map and then added the new OSDs to
> the default root so we would need to rebalance the data only once. This
> almost worked as planned, except for many slow and stuck requests. We
> did this after business hours so the impact was negligable and we didn't
> really investigate, the goal was to finish the rebalancing.
> 
> But only after two days one of the new OSDs (osd.30) already reported
> errors, so we need to replace that disk.
> The replacement disk (osd.0) has been added with an initial crush weight
> of 0 (also reweight 0) to control the backfill with small steps.
> This seems to be harder than it should (also than we experienced so
> far), no matter how small the steps are, the cluster immediately reports
> slow requests. We can't disrupt the production environment so we
> cancelled the backfill/recovery for now. But this procedure has been
> successful in the past with Luminous, that's why we're so surprised.
> 
> The recovery and backfill parameters are pretty low:
> 
>     "osd_max_backfills": "1",
>     "osd_recovery_max_active": "3",
> 
> This usually allowed us a slow backfill to be able to continue
> productive work, now it doesn't.
> 
> Our ceph version is (only the active MDS still runs Luminous, the
> designated server is currently being upgraded):
> 
> 14.2.0-300-gacd2f2b9e1 (acd2f2b9e196222b0350b3b59af9981f91706c7f)
> nautilus (stable)
> 
> Is there anything we missed that we should be aware of in Nautilus
> regarding recovery and replacement scenarios?
> We couldn't reduce the weight of that osd lower than 0.16, everything
> else results in slow requests.
> During the weight reduction several PGs keep stuck in
> activating+remapped state when, only recoverable (sometimes) by
> restarting that affected osd several times. Reducing crush weight leads
> to the same effect.
> 
> Please note: the old servers in root-ec are going to be ec-only OSDs,
> that's why they're still in the cluster.
> 
> Any pointers to what goes wrong here would be highly appreciated! If you
> need any other information I'd be happy to provide it.
> 

Have you tried to dump the historic slow ops on the OSDs involved to see
what is going on?

$ ceph daemon osd.X dump_historic_slow_ops

But to be clear, are all the OSDs on Nautilus or is there a mix of L and
N OSDs?

Wido

> Best regards,
> Eugen
> 
> 
> This is our osd tree:
> 
> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
> -19   11.09143 root root-ec
>  -2    5.54572 host ceph01
>   1   hdd  0.92429 osd.1   down    0 1.0
>   4   hdd  0.92429 osd.4 up    0 1.0
>   6   hdd  0.92429 osd.6 up    0 1.0
>  13   hdd  0.92429 osd.13    up    0 1.0
>  16   hdd  0.92429 osd.16    up    0 1.0
>  18   hdd  0.92429 osd.18    up    0 1.0
>  -3    5.54572 host ceph02
>   2   hdd  0.92429 osd.2 up    0 1.0
>   5   hdd  0.92429 osd.5 up    0 1.0
>   7   hdd  0.92429 osd.7 up    0 1.0
>  12   hdd  0.92429 osd.12    up    0 1.0
>  17   hdd  0.92429 osd.17    up    0 1.0
>  19   hdd  0.92429 osd.19    up    0 1.0
>  -5  0 host ceph03
>  -1   38.32857 root default
> -31   10.79997 host ceph04
>  25   hdd  3.5 osd.25    up  1.0 1.0
>  26   hdd  3.5 osd.26    up  1.0 1.0
>  27   hdd  3.5 osd.27    up  1.0 1.0
> -34   14.39995 host ceph05
>   0   hdd  3.59998 osd.0 up    0 1.0
>  28   hdd  3.5 osd.28    up  1.0 1.0
>  29   hdd  3.5 osd.29    up  1.0 1.0
>  30   hdd  3.5 osd.30    up  0.15999   0
> -37   10.79997 host ceph06
>  31   hdd  3.5 osd.31    up  1.0 1.0
>  32   hdd  3.5 osd.32    up  1.0 1.0
>  33   hdd  3.5 osd.33    up  1.0 1.0
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pools limit

2019-07-24 Thread Wido den Hollander


On 7/16/19 6:53 PM, M Ranga Swami Reddy wrote:
> Thanks for your reply..
> Here, new pool creations and pg auto scale may cause rebalance..which
> impact the ceph cluster performance..
> 
> Please share name space detail like how to use etc
> 

Would it be RBD, Rados, CephFS? What would you be using on top of Ceph?

Wido

> 
> 
> On Tue, 16 Jul, 2019, 9:30 PM Paul Emmerich,  > wrote:
> 
> 100+ pools work fine if you can get the PG count right (auto-scaler
> helps, there are some options that you'll need to tune for small-ish
> pools).
> 
> But it's not a "nice" setup. Have you considered using namespaces
> instead?
> 
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90
> 
> 
> On Tue, Jul 16, 2019 at 4:17 PM M Ranga Swami Reddy
> mailto:swamire...@gmail.com>> wrote:
> 
> Hello - I have created 10 nodes ceph cluster with 14.x version.
> Can you please confirm below:
> 
> Q1 - Can I create 100+ pool (or more) on the cluster? (the
> reason is - creating a pool per project). Any limitation on pool
> creation?
> 
> Q2 - In the above pool - I use 128 PG-NUM - to start with and
> enable autoscale for PG_NUM, so that based on the data in the
> pool, PG_NUM will increase by ceph itself.
> 
> Let me know if any limitations for the above and any fore see issue?
> 
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug slow requests

2019-07-24 Thread Wido den Hollander



On 7/20/19 6:06 PM, Wei Zhao wrote:
> Hi ceph users:
> I was doing  write benchmark, and found some io will be blocked for a
> very long time. The following log is one op , it seems to wait for
> replica to finish. My ceph version is 12.2.4, and the pool is 3+2 EC .
> Does anyone give me some adives about how I sould debug next ?
> 
> {
> "ops": [
> {
> "description": "osd_op(client.17985.0:670679 39.18
> 39:1a63fc5c:::benchmark_data_SH-IDC1-10-5-37-174_2917453_object670678:head
> [set-alloc-hint object_size 1048576 write_size 1048576,write
> 0~1048576] snapc 0=[] ondisk+write+known_if_redirected e1135)",
> "initiated_at": "2019-07-20 23:13:18.725466",
> "age": 329.248875,
> "duration": 329.248901,
> "type_data": {
> "flag_point": "waiting for sub ops",
> "client_info": {
> "client": "client.17985",
> "client_addr": "10.5.137.174:0/1544466091",
> "tid": 670679
> },
> "events": [
> {
> "time": "2019-07-20 23:13:18.725466",
> "event": "initiated"
> },
> {
> "time": "2019-07-20 23:13:18.726585",
> "event": "queued_for_pg"
> },
> {
> "time": "2019-07-20 23:13:18.726606",
> "event": "reached_pg"
> },
> {
> "time": "2019-07-20 23:13:18.726752",
> "event": "started"
> },
> {
> "time": "2019-07-20 23:13:18.726842",
> "event": "waiting for subops from 4"
> },

This usually indicates there is something going on with osd.4

I would go and see if osd.4 is very busy at that moment and check if the
disk is 100% busy in iostat.

It could be a number of things, but I would check that first.

Wido

> {
> "time": "2019-07-20 23:13:18.743134",
> "event": "op_commit"
> },
> {
> "time": "2019-07-20 23:13:18.743137",
> "event": "op_applied"
> }
> ]
> }
> },
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Yan, Zheng
On Wed, Jul 24, 2019 at 3:13 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

>
> which version?
>
> Nautilus, 14.2.2.
>

I mean kernel version


> try mounting cephfs on a machine/vm with small memory (4G~8G), then rsync
> your date into mount point of that machine.
>
> I could try running it in a memory-limited Docker container, but isn't
> there a better way to achieve the same thing? This sounds like a bug to me.
> A client having too much memory and failing to free its capabilities
> shouldn't crash the server. If the server decides to drop things from its
> cache, the client has to deal with it.
>
> Also in the long run, limiting the client's memory isn't a practical
> solution. We are planning to use the CephFS from our compute cluster, whose
> nodes have (and need) many more times the RAM that our storage servers have.
>
>
>
>
>>
>>
>>
>> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
>> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
>> While MDSs are trying to rejoin, they tend to saturate a single thread
>> shortly, but nothing spectacular. During normal operation, none of the
>> cores is particularly under load.
>>
>> > While migrating to a Nautilus cluster recently, we had up to 14
>> > million inodes open, and we increased the cache limit to 16GiB. Other
>> > than warnings about oversized cache, this caused no issues.
>>
>> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
>> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
>> without being kicked again after a few seconds), it did not change much
>> in terms of the actual problem. Right now I can change it to whatever I
>> want, it doesn't do anything, because rank 0 keeps being trashed anyway
>> (the other ranks are fine, but the CephFS is down anyway). Is there
>> anything useful I can give you to debug this? Otherwise I would try
>> killing the MDS daemons so I can at least restore the CephFS to a
>> semi-operational state.
>>
>>
>> >
>> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>> >> Hi,
>> >>
>> >> Disclaimer: I posted this before to the cheph.io mailing list, but
>> from
>> >> the answers I didn't get and a look at the archives, I concluded that
>> >> that list is very dead. So apologies if anyone has read this before.
>> >>
>> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in
>> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I
>> am
>> >> trying to copy is about 23GB, so it's a lot of files. I am copying them
>> >> in batches of 25k using 16 parallel rsync processes over a 10G link.
>> >>
>> >> I started out with 5 MDSs / 2 active, but had repeated issues with
>> >> immense and growing cache sizes far beyond the theoretical maximum of
>> >> 400k inodes which the 16 rsync processes could keep open at the same
>> >> time. The usual inode count was between 1 and 4 million and the cache
>> >> size between 20 and 80GB on average.
>> >>
>> >> After a while, the MDSs started failing under this load by either
>> >> crashing or being kicked from the quorum. I tried increasing the max
>> >> cache size, max log segments, and beacon grace period, but to no avail.
>> >> A crashed MDS often needs minutes to rejoin.
>> >>
>> >> The MDSs fail with the following message:
>> >>
>> >>-21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map
>> is_healthy
>> >> 'MDSRank' had timed out after 15
>> >>-20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
>> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
>> >> heartbeat is not healthy!
>> >>
>> >> I found the following thread, which seems to be about the same general
>> >> issue:
>> >>
>> >>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>> >>
>> >> Unfortunately, it does not really contain a solution except things I
>> >> have tried already. Though it does give some explanation as to why the
>> >> MDSs pile up so many open inodes. It appears like Ceph can't handle
>> many
>> >> (write-only) operations on different files, since the clients keep
>> their
>> >> capabilities open and the MDS can't evict them from its cache. This is
>> >> very baffling to me, since how am I supposed to use a CephFS if I
>> cannot
>> >> fill it with files before?
>> >>
>> >> The next thing I tried was increasing the number of active MDSs. Three
>> >> seemed to make it worse, but four worked surprisingly well.
>> >> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
>> >> Since then the standbys have been (not very successfully) playing
>> >> round-robin to replace it, only to be kicked repeatedly. This is the
>> >> status quo right now and it has been going for hours with no end in
>> >> sight. The only option might be to kill all MDSs and let them restart
>> >> from empty caches.
>> >>
>> >> While trying to rejoin, the MDSs keep logging the above-mentioned 

Re: [ceph-users] Nautilus:14.2.2 Legacy BlueStore stats reporting detected

2019-07-24 Thread nokia ceph
Hi Team,

I guess for cluster installed with Nuatilus this warning will not come and
it is only for upgraded systems.
Please let us know disabling bluestore warn on legacy statfs is the only
option for upgraded clusters.

thanks,
Muthu

On Fri, Jul 19, 2019 at 5:22 PM Paul Emmerich 
wrote:

> bluestore warn on legacy statfs = false
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Fri, Jul 19, 2019 at 1:35 PM nokia ceph 
> wrote:
>
>> Hi Team,
>>
>> After upgrading our cluster from 14.2.1 to 14.2.2 , the cluster moved to
>> warning state with following error
>>
>> cn1.chn6m1c1ru1c1.cdn ~# ceph status
>>   cluster:
>> id: e9afb5f3-4acf-421a-8ae6-caaf328ef888
>> health: HEALTH_WARN
>> Legacy BlueStore stats reporting detected on 335 OSD(s)
>>
>>   services:
>> mon: 5 daemons, quorum cn1,cn2,cn3,cn4,cn5 (age 114m)
>> mgr: cn4(active, since 2h), standbys: cn3, cn1, cn2, cn5
>> osd: 335 osds: 335 up (since 112m), 335 in
>>
>>   data:
>> pools:   1 pools, 8192 pgs
>> objects: 129.01M objects, 849 TiB
>> usage:   1.1 PiB used, 749 TiB / 1.8 PiB avail
>> pgs: 8146 active+clean
>>  46   active+clean+scrubbing
>>
>> Checked the bug list and found that this issue is solved however still
>> exists ,
>>
>> https://github.com/ceph/ceph/pull/28563
>>
>> How to disable this warning?
>>
>> Thanks,
>> Muthu
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Janek Bevendorff


which version?


Nautilus, 14.2.2.

try mounting cephfs on a machine/vm with small memory (4G~8G), then 
rsync your date into mount point of that machine.


I could try running it in a memory-limited Docker container, but isn't 
there a better way to achieve the same thing? This sounds like a bug to 
me. A client having too much memory and failing to free its capabilities 
shouldn't crash the server. If the server decides to drop things from 
its cache, the client has to deal with it.


Also in the long run, limiting the client's memory isn't a practical 
solution. We are planning to use the CephFS from our compute cluster, 
whose nodes have (and need) many more times the RAM that our storage 
servers have.





The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32
threads (Dual
CPU with 8 physical cores each) and 128GB RAM. CPU usage
is rather mild.
While MDSs are trying to rejoin, they tend to saturate a
single thread
shortly, but nothing spectacular. During normal operation,
none of the
cores is particularly under load.

> While migrating to a Nautilus cluster recently, we had
up to 14
> million inodes open, and we increased the cache limit to
16GiB. Other
> than warnings about oversized cache, this caused no issues.

I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB.
Other than getting
rid of the cache size warnings (and sometimes allowing an
MDS to rejoin
without being kicked again after a few seconds), it did
not change much
in terms of the actual problem. Right now I can change it
to whatever I
want, it doesn't do anything, because rank 0 keeps being
trashed anyway
(the other ranks are fine, but the CephFS is down anyway).
Is there
anything useful I can give you to debug this? Otherwise I
would try
killing the MDS daemons so I can at least restore the
CephFS to a
semi-operational state.


>
> On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
>> Hi,
>>
>> Disclaimer: I posted this before to the cheph.io
 mailing list, but from
>> the answers I didn't get and a look at the archives, I
concluded that
>> that list is very dead. So apologies if anyone has read
this before.
>>
>> I am trying to copy our storage server to a CephFS. We
have 5 MONs in
>> our cluster and (now) 7 MDS with max_mds = 4. The list
(!) of files I am
>> trying to copy is about 23GB, so it's a lot of files. I
am copying them
>> in batches of 25k using 16 parallel rsync processes
over a 10G link.
>>
>> I started out with 5 MDSs / 2 active, but had repeated
issues with
>> immense and growing cache sizes far beyond the
theoretical maximum of
>> 400k inodes which the 16 rsync processes could keep
open at the same
>> time. The usual inode count was between 1 and 4 million
and the cache
>> size between 20 and 80GB on average.
>>
>> After a while, the MDSs started failing under this load
by either
>> crashing or being kicked from the quorum. I tried
increasing the max
>> cache size, max log segments, and beacon grace period,
but to no avail.
>> A crashed MDS often needs minutes to rejoin.
>>
>> The MDSs fail with the following message:
>>
>>    -21> 2019-07-22 14:00:05.877 7f67eacec700  1
heartbeat_map is_healthy
>> 'MDSRank' had timed out after 15
>>    -20> 2019-07-22 14:00:05.877 7f67eacec700  0
mds.beacon.XXX Skipping
>> beacon heartbeat to monitors (last acked 24.0042s ago);
MDS internal
>> heartbeat is not healthy!
>>
>> I found the following thread, which seems to be about
the same general
>> issue:
>>
>>

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
>>
>> Unfortunately, it does not really contain a solution
except things I
>> have tried already. Though it does give some
explanation as to why the
>> MDSs pile up so many open inodes. It appears like Ceph
can't handle many
>> (write-only) operations on different files, since the
clients keep their
  

Re: [ceph-users] MDS fails repeatedly while handling many concurrent meta data operations

2019-07-24 Thread Yan, Zheng
On Wed, Jul 24, 2019 at 1:58 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> Ceph-fuse ?
>
> No, I am using the kernel module.
>
>
which version?


>
> Was there "Client xxx failing to respond to cache pressure" health warning?
>
>
> At first, yes (at least with the Mimic client). There were also warnings
> about being behind on trimming. I haven't seen these warnings with Nautilus
> now, but the effect is pretty much the same: boatloads of runaway inodes.
>
> I tried to find other discussions about these warnings here on the list or
> elsewhere in the Internet, but couldn't find anything useful except that it
> shouldn't happen.
>
>
>
try mounting cephfs on a machine/vm with small memory (4G~8G), then rsync
your date into mount point of that machine.


>
>
>
> The MDS nodes have  Xeon E5-2620 v4 CPUs @2.10GHz with 32 threads (Dual
> CPU with 8 physical cores each) and 128GB RAM. CPU usage is rather mild.
> While MDSs are trying to rejoin, they tend to saturate a single thread
> shortly, but nothing spectacular. During normal operation, none of the
> cores is particularly under load.
>
> > While migrating to a Nautilus cluster recently, we had up to 14
> > million inodes open, and we increased the cache limit to 16GiB. Other
> > than warnings about oversized cache, this caused no issues.
>
> I tried settings of 1, 2, 5, 6, 10, 20, 50, and 90GB. Other than getting
> rid of the cache size warnings (and sometimes allowing an MDS to rejoin
> without being kicked again after a few seconds), it did not change much
> in terms of the actual problem. Right now I can change it to whatever I
> want, it doesn't do anything, because rank 0 keeps being trashed anyway
> (the other ranks are fine, but the CephFS is down anyway). Is there
> anything useful I can give you to debug this? Otherwise I would try
> killing the MDS daemons so I can at least restore the CephFS to a
> semi-operational state.
>
>
> >
> > On Tue, Jul 23, 2019 at 2:30 PM Janek Bevendorff wrote:
> >> Hi,
> >>
> >> Disclaimer: I posted this before to the cheph.io mailing list, but from
> >> the answers I didn't get and a look at the archives, I concluded that
> >> that list is very dead. So apologies if anyone has read this before.
> >>
> >> I am trying to copy our storage server to a CephFS. We have 5 MONs in
> >> our cluster and (now) 7 MDS with max_mds = 4. The list (!) of files I am
> >> trying to copy is about 23GB, so it's a lot of files. I am copying them
> >> in batches of 25k using 16 parallel rsync processes over a 10G link.
> >>
> >> I started out with 5 MDSs / 2 active, but had repeated issues with
> >> immense and growing cache sizes far beyond the theoretical maximum of
> >> 400k inodes which the 16 rsync processes could keep open at the same
> >> time. The usual inode count was between 1 and 4 million and the cache
> >> size between 20 and 80GB on average.
> >>
> >> After a while, the MDSs started failing under this load by either
> >> crashing or being kicked from the quorum. I tried increasing the max
> >> cache size, max log segments, and beacon grace period, but to no avail.
> >> A crashed MDS often needs minutes to rejoin.
> >>
> >> The MDSs fail with the following message:
> >>
> >>-21> 2019-07-22 14:00:05.877 7f67eacec700  1 heartbeat_map is_healthy
> >> 'MDSRank' had timed out after 15
> >>-20> 2019-07-22 14:00:05.877 7f67eacec700  0 mds.beacon.XXX Skipping
> >> beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
> >> heartbeat is not healthy!
> >>
> >> I found the following thread, which seems to be about the same general
> >> issue:
> >>
> >>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024944.html
> >>
> >> Unfortunately, it does not really contain a solution except things I
> >> have tried already. Though it does give some explanation as to why the
> >> MDSs pile up so many open inodes. It appears like Ceph can't handle many
> >> (write-only) operations on different files, since the clients keep their
> >> capabilities open and the MDS can't evict them from its cache. This is
> >> very baffling to me, since how am I supposed to use a CephFS if I cannot
> >> fill it with files before?
> >>
> >> The next thing I tried was increasing the number of active MDSs. Three
> >> seemed to make it worse, but four worked surprisingly well.
> >> Unfortunately, the crash came eventually and the rank-0 MDS got kicked.
> >> Since then the standbys have been (not very successfully) playing
> >> round-robin to replace it, only to be kicked repeatedly. This is the
> >> status quo right now and it has been going for hours with no end in
> >> sight. The only option might be to kill all MDSs and let them restart
> >> from empty caches.
> >>
> >> While trying to rejoin, the MDSs keep logging the above-mentioned error
> >> message followed by
> >>
> >> 2019-07-23 17:53:37.386 7f3b135a5700  0 mds.0.cache.ino(0x100019693f8)
> >> have open dirfrag * but not leaf in fragtree_t(*^3): [dir 

Re: [ceph-users] Please help: change IP address of a cluster

2019-07-24 Thread ST Wong (ITSC)
Hi,

Thanks for your help.
I changed IP addresses of OSD nodes successfully.  When changing IP address on 
MON, it works except that the MON only listens on v2 port 3300 after adding the 
MON back to the cluster.  Previously the MON listens on both v1 (6789) and v2 
(3300).  
Besides, can't add both v1 and v2 entries manually using monmaptool like 
following.  Only the 2nd one will get added.

monmaptool -add node1  v1:10.0.1.1:6789, v2:10.0.1.1:3330  {tmp}/{filename}

the monmap now looks like following:

min_mon_release 14 (nautilus)
0: [v2:10.0.1.92:3300/0,v1:10.0.1.92:6789/0] mon.cmon2
1: [v2:10.0.1.93:3300/0,v1:10.0.1.93:6789/0] mon.cmon2
2: [v2:10.0.1.94:3300/0,v1:10.0.1.94:6789/0] mon.cmon3
3: [v2:10.0.1.95:3300/0,v1:10.0.1.95:6789/0] mon.cmon4
4: v2:10.0.1.97:3300/0 mon.cmon5<--- the MON 
being removed/added 

Although it's okay to use v2 only, I'm afraid I've missed some steps and messed 
the cluster up.Any advice?
Thanks again.

Best Regards,
/stwong

-Original Message-
From: ceph-users  On Behalf Of Manuel Lausch
Sent: Tuesday, July 23, 2019 7:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Please help: change IP address of a cluster

Hi,

I had to change the IPs of my cluster some time ago. The process was quite easy.

I don't understand what you mean with configuring and deleting static routes. 
The easies way is if the router allows (at least for the
change) all traffic between the old and the new network. 

I did the following steps.

1. Add the new IP Network space separated to the "public network" line in your 
ceph.conf

2. OSDS: stop you OSDs on the first node. Reconfigure the host network and 
start your OSDs again. Repeat this for all hosts one by one

3. MON: stop and remove one mon from cluster, delete all data in 
/var/ceph/mon/mon. reconfigure the host network. Create the new mon 
instance (don't forget the "mon host" entrys in your ceph.conf and your clients 
as well) Of course this requires at least 3 Mons in your cluster!
After 2 of 5 Mons in my cluster I added the new mon adresses to my clients and 
restarted them. 

4. MGR: stop the mgr daemon. reconfigure the host network. Start the mgr daemon 
one by one


I wouldn't recomend the "messy way" to reconfigure your mons. removing and 
adding mons to the cluster is quite easy and in my opinion the most secure.

The complet IP change in our cluster worked without outage while the cluster 
was in production.

I hope I could help you.

Regards
Manuel



On Fri, 19 Jul 2019 10:22:37 +
"ST Wong (ITSC)"  wrote:

> Hi all,
> 
> Our cluster has to change to new IP range in same VLAN:  10.0.7.0/24
> -> 10.0.18.0/23, while IP address on private network for OSDs
> remains unchanged. I wonder if we can do that in either one following
> ways:
> 
> =
> 
> 1.
> 
> a.   Define static route for 10.0.18.0/23 on each node
> 
> b.   Do it one by one:
> 
> For each monitor/mgr:
> 
> -  remove from cluster
> 
> -  change IP address
> 
> -  add static route to original IP range 10.0.7.0/24
> 
> -  delete static route for 10.0.18.0/23
> 
> -  add back to cluster
> 
> For each OSD:
> 
> -  stop OSD daemons
> 
> -   change IP address
> 
> -  add static route to original IP range 10.0.7.0/24
> 
> -  delete static route for 10.0.18.0/23
> 
> -  start OSD daemons
> 
> c.   Clean up all static routes defined.
> 
> 
> 
> 2.
> 
> a.   Export and update monmap using the messy way as described in
> http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/
> 
> 
> 
> ceph mon getmap -o {tmp}/{filename}
> 
> monmaptool -rm node1 -rm node2 ... --rm node n {tmp}/{filename}
> 
> monmaptool -add node1 v2:10.0.18.1:3330,v1:10.0.18.1:6789 -add node2
> v2:10.0.18.2:3330,v1:10.0.18.2:6789 ... --add nodeN
> v2:10.0.18.N:3330,v1:10.0.18.N:6789  {tmp}/{filename}
> 
> 
> 
> b.   stop entire cluster daemons and change IP addresses
> 
> 
> c.   For each mon node:  ceph-mon -I {mon-id} -inject-monmap
> {tmp}/{filename}
> 
> 
> 
> d.   Restart cluster daemons.
> 
> 
> 
> 3.   Or any better method...
> =
> 
> Would anyone please help?   Thanks a lot.
> Rgds
> /st wong
> 



--
Manuel Lausch

Systemadministrator
Storage Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Alexander Charles, Thomas Ludwig, Jan Oetjen, Sascha Vollmer


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen 
enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten 
Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, 

Re: [ceph-users] Nautilus 14.2.1 / 14.2.2 crash

2019-07-24 Thread Alex Litvak
The only possible hint, crush coincides with a scrub time interval start.  Why it didn't happen yesterday at the same time, I have no idea.  I returned default debug settings with a hope that I get a 
little bit more info when next crush happens.  I really would like to debug only specific components rather than turning everything up to 20.  Sorry for hijacking the post, I will create a new one 
when I have more information.


On 7/23/2019 9:50 PM, Alex Litvak wrote:
I just had an osd crashed with no logs (debug was not enabled).  Happened 24 hours later after actual upgrade from 14.2.1 to 14.2.2.  Nothing else changed as far as environment or load.  Disk is OK. 
Restarted osd and it came back.  Had cluster up for 2 month until the upgrade without an issue.


On 7/23/2019 2:56 PM, Nathan Fish wrote:

I have not had any more OSDs crash, but the 3 that crashed still crash
on startup. I may purge and recreate them, but there's no hurry. I
have 18 OSDs per host and plenty of free space currently.

On Tue, Jul 23, 2019 at 2:19 AM Ashley Merrick  wrote:


Have they been stable since, or still had some crash?

,Thanks

 On Sat, 20 Jul 2019 10:09:08 +0800 Nigel Williams 
 wrote 


On Sat, 20 Jul 2019 at 04:28, Nathan Fish  wrote:

On further investigation, it seems to be this bug:
http://tracker.ceph.com/issues/38724


We just upgraded to 14.2.2, and had a dozen OSDs at 14.2.2 go down this bug, 
recovered with:

systemctl reset-failed ceph-osd@160
systemctl start ceph-osd@160


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com