Hi all,

maybe a further alternative.

With our support contract I get exact replacements. I found out that doing an 
off-line copy of a still readable OSD with ddrescue speeds things up 
dramatically and avoids extended periods of degraded PGs.

Situation and what I did:

I had a disk with repeated deep scrub errors and checking with smartctl I could 
see that it started remapping sectors. This showed up as PG scrub error. I 
initiated a full deep scrub of the disk and run PG repair on every PG that was 
marked as having errors. This way, ceph rewrites the broken object and the disk 
writes it to a remapped, that is, healthy sector. Doing this a couple of times 
will leave you with a disk that is 100% readable.

I then shut the OSD down. This lead to recovery IO as expected and after less 
than 2 hours everything was rebuilt to full redundancy (it was probably faster, 
I only checked after 2 hours). Recovery from single disk fail is very fast due 
to all-to-all rebuild.

In the mean time, I did a full disk copy with ddrescue to a large file system 
space I have on a copy station. Took 16h for a 12TB drive. Right after this, 
the replacement arrived and I copied the image back. Another 16h.

After this, I simply inserted the new disk with the 5 days old OSD copy and 
brought it up (there was a weekend in between). Almost all objects on the drive 
were still up-to-date and after just 30 minutes all PGs were active and clean. 
Nothing remapped or misplaced any more.

For comparison, I once added a single drive and it took 2 weeks for the 
affected PGs to be active+clean again. The off-line copy can use much more 
aggressive and effective IO to a single drive than ceph rebalancing ever would.

For single-disk exchange on our service contract I will probably continue with 
the ddrescue method even though it requires manual action.

For the future I plan to adapt a different strategy to utilize the all-to-all 
copy capability of ceph. Exchanging single disks seems not to be a good way to 
run ceph. I will rather have a larger amount of disks act as hot spares. For 
example, having enough capacity that one can tolerate loosing 10% of all disks 
before replacing anything. Adding a large number of disks is overall more 
effective as it will basically take the same time to get back to health OK as 
exchanging a single disk.

With my timings, this "replace many disks not single ones" will amortise if at 
least 5-6 drives failed and are down+out. It will also limit writes to degraded 
PGs to the shortest interval possible.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <[email protected]>
Sent: 28 November 2020 05:55:06
To: Tony Liu
Cc: [email protected]
Subject: [ceph-users] Re: replace osd with Octopus

>>
>
> Here is the context.
> https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
>
> When disk is broken,
> 1) orch osd rm <svc_id(s)> --replace [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
>
> Step #1 marks OSD "destroyed". I assume it has the same effect as
> "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> cluster is in "degrade" state.
>
> After step #3, OSD will be "up" and "in", data will be recovered
> back to new disk. Is that right?

Yes.

> Is cluster "degrade" or "healthy" during such recovery?

It will be degraded, because there are fewer copies of some data available than 
during normal operation.  Clients will continue to access all data.

> For another option, the difference is no "--replace" in step #1.
> 1) orch osd rm <svc_id(s)> [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
>
> Step #1 evacuates PGs from OSD and removes it from cluster.
> If disk is broken or OSD daemon is down, is this evacuation still
> going to work?

Yes, of course — broken drives are the typical reason for removing OSDs.

> Is it going to take a while if there is lots data on this disk?

Yes, depending on what “a while” means to you, the size of the cluster, whether 
the pool is replicated or EC, and whether these are HDDs or SSDs.

> After step #3, PGs will be rebalanced/remapped again when new OSD
> joins the cluster.
>
> I think, to replace with the same disk model, option #1 is preferred,
> to replace with different disk model, it needs to be option #2.

I haven’t tried it under Octopus, but I don’t think this is strictly true.  If 
you replace it with a different model that is approximately the same size, 
everything will be fine.  Through Luminous and I think Nautilus at least, if 
you `destroy` and replace with a larger drive, the CRUSH weight of the OSD will 
still reflect that of the old drive.  You could then run `ceph osd crush 
reweight` after deploying to adjust the size.  You could record the CRUSH 
weights of all your drive models for initial OSD deploys, or you could `ceph 
osd tree` and look for another OSD of the same model, and set the CRUSH weight 
accordingly.

If you replace with a smaller drive, your cluster will lose a small amount of 
usable capacity.  If you replace with a larger drive, the cluster may or may 
not enjoy a slight increase in capacity — that depends on replication strategy, 
rack/host weights, etc.

My personal philosophy on drive replacements:

o Build OSDs with `—dmcrypt` so that you don’t have to worry about data if/when 
you RMA or recycle bad drives.  RMAs are a hassle, so pick a certain value 
threshold before a drive is worth the effort.  This might be in the $250-500 
range for example, which means that for many HDDs it isn’t worth RMAing them.

o If you have an exact replacement, use it

o When buying spares, buy the largest size drive you have deployed — or will 
deploy within the next year or so.  That way you know that your spares can take 
the place of any drive you have, so you don’t have to maintain stock of more 
than one size. Worst case you don’t immediately make good use of that extra 
capacity, but you may in the future as drives in other failure domains fail and 
are replaced.  Be careful, though of mixing  drives that a lot different in 
size.  Mixing 12 and 14 TB drives, even 12 and 16 is usually no big deal, but 
if you mix say 1TB and 16 TB drives, you can end up exceeding 
`mon_max_pg_per_osd`.  Which is one reason why I like to increase it from the 
default value to, say, 400.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to