Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-17 Thread Dan Jakubiec
Also worth pointing out something a bit obvious but: this kind of 
faster/destructive migration should only be attempted if all your pools are at 
least 3x replicated.

For example, if you had a 1x replicated pool you would lose data using this 
approach.

-- Dan

> On Jan 11, 2018, at 14:24, Reed Dier  wrote:
> 
> Thank you for documenting your progress and peril on the ML.
> 
> Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
> bluestore.
> 
> 8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
> able to do about 3 at a time (1 node) for rip/replace.
> 
> Definitely taking it slow and steady, and the SSDs will move quickly for 
> backfills as well.
> Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
> everything, about 5TB average util on each 8TB disk, so just about 30 
> hours-ish per host *8 hosts will be about 10 days, so a couple weeks is a 
> safe amount of headway.
> This write performance certainly seems better on bluestore than filestore, so 
> that likely helps as well.
> 
> Expect I can probably refill an SSD osd in about an hour or two, and will 
> likely stagger those out.
> But with such a small number of osd’s currently, I’m taking the by-hand 
> approach rather than scripting it so as to avoid similar pitfalls.
> 
> Reed 
> 
>> On Jan 11, 2018, at 12:38 PM, Brady Deetz > > wrote:
>> 
>> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
>> about a disaster I created automating my migration. Good luck
>> 
>> On Jan 11, 2018 12:22 PM, "Reed Dier" > > wrote:
>> I am in the process of migrating my OSDs to bluestore finally and thought I 
>> would give you some input on how I am approaching it.
>> Some of saga you can find in another ML thread here: 
>> https://www.spinics.net/lists/ceph-users/msg41802.html 
>> 
>> 
>> My first OSD I was cautious, and I outed the OSD without downing it, 
>> allowing it to move data off.
>> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
>> NVMe partition previously used for journaling in filestore, intending to be 
>> used for block.db in bluestore.
>> 
>> Then I downed it, flushed the journal, destroyed it, zapped with 
>> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove 
>> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume 
>> locally to create the new LVM target. Then unset the norecover and 
>> norebalance flags and it backfilled like normal.
>> 
>> I initially ran into issues with specifying --osd.id  
>> causing my osd’s to fail to start, but removing that I was able to get it to 
>> fill in the gap of the OSD I just removed.
>> 
>> I’m now doing quicker, more destructive migrations in an attempt to reduce 
>> data movement.
>> This way I don’t read from OSD I’m replacing, write to other OSD 
>> temporarily, read back from temp OSD, write back to ‘new’ OSD.
>> I’m just reading from replica and writing to ‘new’ OSD.
>> 
>> So I’m setting the norecover and norebalance flags, down the OSD (but not 
>> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
>> ceph-volume, unset the flags, and it starts backfilling.
>> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
>> to offload it and then backfill back from them. I trust my disks enough to 
>> backfill from the other disks, and its going well. Also seeing very good 
>> write performance backfilling compared to previous drive replacements in 
>> filestore, so thats very promising.
>> 
>> Reed
>> 
>>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen >> > wrote:
>>> 
>>> Hi Alfredo,
>>> 
>>> thank you for your comments:
>>> 
>>> Zitat von Alfredo Deza mailto:ad...@redhat.com>>:
 On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen >>> > wrote:
> Dear *,
> 
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
> 
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>  
> 
> - this basically says
> 
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
> 
> I never got step 4 to complete. The closest I got was by doing the 
> following
> steps (assuming OSD ID "999" on /dev/sdzz):
> 
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service )

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-17 Thread Jens-U. Mozdzen

Zitat von Jens-U. Mozdzen

Hi Alfredo,

thank you for your comments:

Zitat von Alfredo Deza :

On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
keeping the OSD number? There have been a number of messages on the list,
reporting problems, and my experience is the same. (Removing the existing
OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
- this basically says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the  
following

steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat
/var/lib/ceph/osd/ceph-999/fsid) 999)


Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.



6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
/var/lib/ceph/osd/ceph-999/keyring)


I at first tried to follow the documented steps (without my steps 5
and 6), which did not work for me. The documented approach failed with
"init authentication >> failed: (1) Operation not permitted", because
actually ceph-volume did not add the auth entry for me.

But even after manually adding the authentication, the "ceph-volume"
approach failed, as the OSD was still marked "destroyed" in the osdmap
epoch as used by ceph-osd (see the commented messages from
ceph-osd.999.log below).



7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
--osd-id 999 --data /dev/sdzz)


You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642


If I read that bug description correctly, you're confirming why I
needed step #6 above (manually adding the OSD auth entry. But even if
ceph-volume had added it, the ceph-osd.log entries suggest that
starting the OSD would still have failed, because of accessing the
wrong osdmap epoch.

To me it seems like I'm hitting a bug outside of ceph-volume [...]


just for the records (and search engines), this was confirmed to be a  
bug, see http://tracker.ceph.com/issues/22673


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
Thank you for documenting your progress and peril on the ML.

Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
bluestore.

8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
able to do about 3 at a time (1 node) for rip/replace.

Definitely taking it slow and steady, and the SSDs will move quickly for 
backfills as well.
Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
everything, about 5TB average util on each 8TB disk, so just about 30 hours-ish 
per host *8 hosts will be about 10 days, so a couple weeks is a safe amount of 
headway.
This write performance certainly seems better on bluestore than filestore, so 
that likely helps as well.

Expect I can probably refill an SSD osd in about an hour or two, and will 
likely stagger those out.
But with such a small number of osd’s currently, I’m taking the by-hand 
approach rather than scripting it so as to avoid similar pitfalls.

Reed 

> On Jan 11, 2018, at 12:38 PM, Brady Deetz  wrote:
> 
> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
> about a disaster I created automating my migration. Good luck
> 
> On Jan 11, 2018 12:22 PM, "Reed Dier"  > wrote:
> I am in the process of migrating my OSDs to bluestore finally and thought I 
> would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here: 
> https://www.spinics.net/lists/ceph-users/msg41802.html 
> 
> 
> My first OSD I was cautious, and I outed the OSD without downing it, allowing 
> it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
> NVMe partition previously used for journaling in filestore, intending to be 
> used for block.db in bluestore.
> 
> Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
> set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
> auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
> create the new LVM target. Then unset the norecover and norebalance flags and 
> it backfilled like normal.
> 
> I initially ran into issues with specifying --osd.id  causing 
> my osd’s to fail to start, but removing that I was able to get it to fill in 
> the gap of the OSD I just removed.
> 
> I’m now doing quicker, more destructive migrations in an attempt to reduce 
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
> read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
> 
> So I’m setting the norecover and norebalance flags, down the OSD (but not 
> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
> ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
> to offload it and then backfill back from them. I trust my disks enough to 
> backfill from the other disks, and its going well. Also seeing very good 
> write performance backfilling compared to previous drive replacements in 
> filestore, so thats very promising.
> 
> Reed
> 
>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen > > wrote:
>> 
>> Hi Alfredo,
>> 
>> thank you for your comments:
>> 
>> Zitat von Alfredo Deza mailto:ad...@redhat.com>>:
>>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen >> > wrote:
 Dear *,
 
 has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
 keeping the OSD number? There have been a number of messages on the list,
 reporting problems, and my experience is the same. (Removing the existing
 OSD and creating a new one does work for me.)
 
 I'm working on an Ceph 12.2.2 cluster and tried following
 http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
  
 
 - this basically says
 
 1. destroy old OSD
 2. zap the disk
 3. prepare the new OSD
 4. activate the new OSD
 
 I never got step 4 to complete. The closest I got was by doing the 
 following
 steps (assuming OSD ID "999" on /dev/sdzz):
 
 1. Stop the old OSD via systemd (osd-node # systemctl stop
 ceph-osd@999.service)
 
 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
 
 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
 volume group
 
 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
 
 4. destroy the old OSD (osd-node # ceph osd destroy 999
 --yes-i-really-mean-it)
 
 5. create a new OSD entry (osd-node # ceph osd new $(cat
 /var/lib/ceph/osd/ceph-999/fsid) 999)
>>> 
>>> Step 5 and 6 are probl

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Brady Deetz
I hear you on time. I have 350 x 6TB drives to convert. I recently posted
about a disaster I created automating my migration. Good luck

On Jan 11, 2018 12:22 PM, "Reed Dier"  wrote:

> I am in the process of migrating my OSDs to bluestore finally and thought
> I would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here:
> https://www.spinics.net/lists/ceph-users/msg41802.html
>
> My first OSD I was cautious, and I outed the OSD without downing it,
> allowing it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an
> NVMe partition previously used for journaling in filestore, intending to be
> used for block.db in bluestore.
>
> Then I downed it, flushed the journal, destroyed it, zapped with
> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove
> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used
> ceph-volume locally to create the new LVM target. Then unset the norecover
> and norebalance flags and it backfilled like normal.
>
> I initially ran into issues with specifying --osd.id causing my osd’s to
> fail to start, but removing that I was able to get it to fill in the gap of
> the OSD I just removed.
>
> I’m now doing quicker, more destructive migrations in an attempt to reduce
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD
> temporarily, read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
>
> So I’m setting the norecover and norebalance flags, down the OSD (but not
> out, it stays in, also have the noout flag set), destroy/zap, recreate
> using ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a *long* time
> to offload it and then backfill back from them. I trust my disks enough to
> backfill from the other disks, and its going well. Also seeing very good
> write performance backfilling compared to previous drive replacements in
> filestore, so thats very promising.
>
> Reed
>
> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
>
> Hi Alfredo,
>
> thank you for your comments:
>
> Zitat von Alfredo Deza :
>
> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>
> Dear *,
>
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
>
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-
> or-rm-osds/#replacing-an-osd
> - this basically says
>
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
>
> I never got step 4 to complete. The closest I got was by doing the
> following
> steps (assuming OSD ID "999" on /dev/sdzz):
>
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service)
>
> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>
> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
> volume group
>
> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>
> 4. destroy the old OSD (osd-node # ceph osd destroy 999
> --yes-i-really-mean-it)
>
> 5. create a new OSD entry (osd-node # ceph osd new $(cat
> /var/lib/ceph/osd/ceph-999/fsid) 999)
>
>
> Step 5 and 6 are problematic if you are going to be trying ceph-volume
> later on, which takes care of doing this for you.
>
>
> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-999/keyring)
>
>
> I at first tried to follow the documented steps (without my steps 5 and
> 6), which did not work for me. The documented approach failed with "init
> authentication >> failed: (1) Operation not permitted", because actually
> ceph-volume did not add the auth entry for me.
>
> But even after manually adding the authentication, the "ceph-volume"
> approach failed, as the OSD was still marked "destroyed" in the osdmap
> epoch as used by ceph-osd (see the commented messages from ceph-osd.999.log
> below).
>
>
> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
> --osd-id 999 --data /dev/sdzz)
>
>
> You are going to hit a bug in ceph-volume that is preventing you from
> specifying the osd id directly if the ID has been destroyed.
>
> See http://tracker.ceph.com/issues/22642
>
>
> If I read that bug description correctly, you're confirming why I needed
> step #6 above (manually adding the OSD auth entry. But even if ceph-volume
> had added it, the ceph-osd.log entries suggest that starting the OSD would
> still have failed, because of accessing the wrong osdmap epoch.
>
> To me it seems like I'm hitting a bug outside of ceph-volume - unless it

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
I am in the process of migrating my OSDs to bluestore finally and thought I 
would give you some input on how I am approaching it.
Some of saga you can find in another ML thread here: 
https://www.spinics.net/lists/ceph-users/msg41802.html 


My first OSD I was cautious, and I outed the OSD without downing it, allowing 
it to move data off.
Some background on my cluster, for this OSD, it is an 8TB spinner, with an NVMe 
partition previously used for journaling in filestore, intending to be used for 
block.db in bluestore.

Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
create the new LVM target. Then unset the norecover and norebalance flags and 
it backfilled like normal.

I initially ran into issues with specifying --osd.id causing my osd’s to fail 
to start, but removing that I was able to get it to fill in the gap of the OSD 
I just removed.

I’m now doing quicker, more destructive migrations in an attempt to reduce data 
movement.
This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
read back from temp OSD, write back to ‘new’ OSD.
I’m just reading from replica and writing to ‘new’ OSD.

So I’m setting the norecover and norebalance flags, down the OSD (but not out, 
it stays in, also have the noout flag set), destroy/zap, recreate using 
ceph-volume, unset the flags, and it starts backfilling.
For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time to 
offload it and then backfill back from them. I trust my disks enough to 
backfill from the other disks, and its going well. Also seeing very good write 
performance backfilling compared to previous drive replacements in filestore, 
so thats very promising.

Reed

> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
> 
> Hi Alfredo,
> 
> thank you for your comments:
> 
> Zitat von Alfredo Deza mailto:ad...@redhat.com>>:
>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>>> Dear *,
>>> 
>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>> keeping the OSD number? There have been a number of messages on the list,
>>> reporting problems, and my experience is the same. (Removing the existing
>>> OSD and creating a new one does work for me.)
>>> 
>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>> - this basically says
>>> 
>>> 1. destroy old OSD
>>> 2. zap the disk
>>> 3. prepare the new OSD
>>> 4. activate the new OSD
>>> 
>>> I never got step 4 to complete. The closest I got was by doing the following
>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>> 
>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>> ceph-osd@999.service)
>>> 
>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>> 
>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>> volume group
>>> 
>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>> 
>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>> --yes-i-really-mean-it)
>>> 
>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>> 
>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>> later on, which takes care of doing this for you.
>> 
>>> 
>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>> /var/lib/ceph/osd/ceph-999/keyring)
> 
> I at first tried to follow the documented steps (without my steps 5 and 6), 
> which did not work for me. The documented approach failed with "init 
> authentication >> failed: (1) Operation not permitted", because actually 
> ceph-volume did not add the auth entry for me.
> 
> But even after manually adding the authentication, the "ceph-volume" approach 
> failed, as the OSD was still marked "destroyed" in the osdmap epoch as used 
> by ceph-osd (see the commented messages from ceph-osd.999.log below).
> 
>>> 
>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>> --osd-id 999 --data /dev/sdzz)
>> 
>> You are going to hit a bug in ceph-volume that is preventing you from
>> specifying the osd id directly if the ID has been destroyed.
>> 
>> See http://tracker.ceph.com/issues/22642 
>> 
> 
> If I read that bug description correctly, you're confirming why I needed step 
> #6 above (manually adding the OSD auth entry. But even if ceph-volume had 
> added it, the ceph-osd.log entries suggest that starting the OSD would still 
> have failed, because of accessing the wrong osdmap epoch.
> 
> To me it seems like I'm hitting a bug outside of ceph-volume - unless it's 
>

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-10 Thread Jens-U. Mozdzen

Hi Alfredo,

thank you for your comments:

Zitat von Alfredo Deza :

On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
keeping the OSD number? There have been a number of messages on the list,
reporting problems, and my experience is the same. (Removing the existing
OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
- this basically says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the following
steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat
/var/lib/ceph/osd/ceph-999/fsid) 999)


Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.



6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
/var/lib/ceph/osd/ceph-999/keyring)


I at first tried to follow the documented steps (without my steps 5  
and 6), which did not work for me. The documented approach failed with  
"init authentication >> failed: (1) Operation not permitted", because  
actually ceph-volume did not add the auth entry for me.


But even after manually adding the authentication, the "ceph-volume"  
approach failed, as the OSD was still marked "destroyed" in the osdmap  
epoch as used by ceph-osd (see the commented messages from  
ceph-osd.999.log below).




7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
--osd-id 999 --data /dev/sdzz)


You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642


If I read that bug description correctly, you're confirming why I  
needed step #6 above (manually adding the OSD auth entry. But even if  
ceph-volume had added it, the ceph-osd.log entries suggest that  
starting the OSD would still have failed, because of accessing the  
wrong osdmap epoch.


To me it seems like I'm hitting a bug outside of ceph-volume - unless  
it's ceph-volume that somehow determines which osdmap epoch is used by  
ceph-osd.



In order for this to work, you would need to make sure that the ID has
really been destroyed and avoid passing --osd-id in ceph-volume. The
caveat
being that you will get whatever ID is available next in the cluster.


Yes, that's the work-around I then used - purge the old OSD and create  
a new one.


Thanks & regards,
Jens


[...]
--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
failed: (1) Operation not permitted

# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at that
time and osd.999 is marked destroyed there
# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using weightedpriority
op queue with priority

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-10 Thread Alfredo Deza
On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
> Dear *,
>
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
>
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
> - this basically says
>
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
>
> I never got step 4 to complete. The closest I got was by doing the following
> steps (assuming OSD ID "999" on /dev/sdzz):
>
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service)
>
> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>
> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
> volume group
>
> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>
> 4. destroy the old OSD (osd-node # ceph osd destroy 999
> --yes-i-really-mean-it)
>
> 5. create a new OSD entry (osd-node # ceph osd new $(cat
> /var/lib/ceph/osd/ceph-999/fsid) 999)

Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.

>
> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-999/keyring)
>
> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
> --osd-id 999 --data /dev/sdzz)
> mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-999/keyring)

You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642

In order for this to work, you would need to make sure that the ID has
really been destroyed and avoid passing --osd-id in ceph-volume. The
caveat
being that you will get whatever ID is available next in the cluster.

>
> but ceph-osd keeps complaining "osdmap says I am destroyed, exiting" on
> "osd-node # systemctl start ceph-osd@999.service".
>
> At first I felt I was hitting http://tracker.ceph.com/issues/21023
> (BlueStore-OSDs marked as destroyed in OSD-map after v12.1.1 to v12.1.4
> upgrade). But I was already using the "ceph osd new" command, which didn't
> help.
>
> Some hours of sleep later I matched the issued commands to the osdmap
> changes and the ceph-osd log messages, which revealed something strange:
>
> - from issuing "ceph osd destroy", osdmap lists the OSD as
> "autoout,destroyed,exists" (no surprise here)
> - once I issued "ceph osd new", osdmap lists the OSD as "autoout,exists,new"
> - starting ceph-osd after "ceph osd new" reports "osdmap says I am
> destroyed, exiting"
>
> I can see in the ceph-osd log that it is relating to an *old* osdmap epoch,
> roughly 45 minutes old by then?
>
> This got me curious and I dug through the OSD log file, checking the epoch
> numbers during start-up:
>
> I took some detours, so there's more than two failed starts in the OSD log
> file ;) :
>
> --- cut here ---
> # first of multiple attempts, before "ceph auth add ..."
> # no actual epoch referenced, as login failed due to missing auth
> 2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
> 288232575208783872, adjusting msgr requires for clients
> 2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
> 288232575208783872 was 8705, adjusting msgr requires for mons
> 2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
> 288232575208783872, adjusting msgr requires for osds
> 2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
> 2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
> 2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
> op queue with priority op cut off at 64.
> 2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
> {default=true}
> 2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
> failed: (1) Operation not permitted
>
> # after "ceph auth ..."
> # note the different epochs below? BTW, 110587 is the current epoch at that
> time and osd.999 is marked destroyed there
> # 109892: much too old to offer any details
> # 110587: modified 2018-01-09 23:43:13.202381
>
> 2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
> 288232575208783872, adjusting msgr requires for clients
> 2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
> 288232575208783872 was 8705, adjusting msgr requires for mons
> 2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
> 288232575208783872, adjusting msgr requires for osds
> 2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
> 2018-01-10 00:08:00.945594 7fc

[ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-10 Thread Jens-U. Mozdzen

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore  
OSDs, keeping the OSD number? There have been a number of messages on  
the list, reporting problems, and my experience is the same. (Removing  
the existing OSD and creating a new one does work for me.)


I'm working on an Ceph 12.2.2 cluster and tried following  
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd - this basically  
says


1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the  
following steps (assuming OSD ID "999" on /dev/sdzz):


1. Stop the old OSD via systemd (osd-node # systemctl stop  
ceph-osd@999.service)


2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old  
OSD's volume group


3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999  
--yes-i-really-mean-it)


5. create a new OSD entry (osd-node # ceph osd new $(cat  
/var/lib/ceph/osd/ceph-999/fsid) 999)


6. add the OSD secret to Ceph authentication (osd-node # ceph auth add  
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd'  
-i /var/lib/ceph/osd/ceph-999/keyring)


7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore  
--osd-id 999 --data /dev/sdzz)

mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-999/keyring)

but ceph-osd keeps complaining "osdmap says I am destroyed, exiting"  
on "osd-node # systemctl start ceph-osd@999.service".


At first I felt I was hitting http://tracker.ceph.com/issues/21023  
(BlueStore-OSDs marked as destroyed in OSD-map after v12.1.1 to  
v12.1.4 upgrade). But I was already using the "ceph osd new" command,  
which didn't help.


Some hours of sleep later I matched the issued commands to the osdmap  
changes and the ceph-osd log messages, which revealed something strange:


- from issuing "ceph osd destroy", osdmap lists the OSD as  
"autoout,destroyed,exists" (no surprise here)

- once I issued "ceph osd new", osdmap lists the OSD as "autoout,exists,new"
- starting ceph-osd after "ceph osd new" reports "osdmap says I am  
destroyed, exiting"


I can see in the ceph-osd log that it is relating to an *old* osdmap  
epoch, roughly 45 minutes old by then?


This got me curious and I dug through the OSD log file, checking the  
epoch numbers during start-up:


I took some detours, so there's more than two failed starts in the OSD  
log file ;) :


--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds

2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init  
authentication failed: (1) Operation not permitted


# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at  
that time and osd.999 is marked destroyed there

# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds

2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,  
starting boot process
2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for  
initial osdmap
2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064