[ceph-users] Re: Ceph recovery

2023-05-01 Thread wodel youchi
Thank you for the clarification.

On Mon, May 1, 2023, 20:11 Wesley Dillingham  wrote:

> Assuming size=3 and min_size=2 It will run degraded (read/write capable)
> until a third host becomes available at which point it will backfill the
> third copy on the third host. It will be unable to create the third copy of
> data if no third host exists. If an additional host is lost the data will
> become inactive+degraded (below min_size) and will be unavailable for use.
> Though data will not be lost assuming no further failures beyond the 2 full
> hosts occurs and again if the second and third host comes back the data
> will recover. Always best to have an additional host beyond the size
> setting for this reason.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Mon, May 1, 2023 at 11:34 AM wodel youchi 
> wrote:
>
>> Hi,
>>
>> When creating a ceph cluster, a failover domain is created, and by default
>> it uses host as a minimal domain, that domain can be modified to chassis,
>> or rack, ...etc.
>>
>> My question is :
>> Suppose I have three osd nodes, my replication is 3 and my failover domain
>> is host, which means that each copy of data is stored on a different node.
>>
>> What happens when one node crashes, does Ceph use the remaining free space
>> on the other two to create the third copy, or the ceph cluster will run in
>> degraded mode, like a RAID5
>>  which lost a disk.
>>
>> Regards.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery

2023-05-01 Thread Wesley Dillingham
Assuming size=3 and min_size=2 It will run degraded (read/write capable)
until a third host becomes available at which point it will backfill the
third copy on the third host. It will be unable to create the third copy of
data if no third host exists. If an additional host is lost the data will
become inactive+degraded (below min_size) and will be unavailable for use.
Though data will not be lost assuming no further failures beyond the 2 full
hosts occurs and again if the second and third host comes back the data
will recover. Always best to have an additional host beyond the size
setting for this reason.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, May 1, 2023 at 11:34 AM wodel youchi  wrote:

> Hi,
>
> When creating a ceph cluster, a failover domain is created, and by default
> it uses host as a minimal domain, that domain can be modified to chassis,
> or rack, ...etc.
>
> My question is :
> Suppose I have three osd nodes, my replication is 3 and my failover domain
> is host, which means that each copy of data is stored on a different node.
>
> What happens when one node crashes, does Ceph use the remaining free space
> on the other two to create the third copy, or the ceph cluster will run in
> degraded mode, like a RAID5
>  which lost a disk.
>
> Regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-07-01 Thread Curt
On Wed, Jun 29, 2022 at 11:22 PM Curt  wrote:

>
>
> On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman  wrote:
>
>> On 6/29/22 19:34, Curt wrote:
>> > Hi Stefan,
>> >
>> > Thank you, that definitely helped. I bumped it to 20% for now and
>> that's
>> > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll
>> > see how that runs and then increase it a bit more if the cluster
>> handles
>> > it ok.
>> >
>> > Do you think it's worth enabling scrubbing while backfilling?
>>
>> If the cluster can cope with the extra load, sure. If it slows down the
>> backfilling to levels that are too slow ... temporarily disable it.
>>
>> Since
>> > this is going to take a while. I do have 1 inconsistent PG that has now
>> > become 10 as it splits.
>>
>> Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd
>> set nobackfill) and have it handle this ASAP: ceph pg repair $pg.
>> Something is wrong, and you want to have this fixed sooner rather than
>> later.
>>
>
>  When I try to run a repair nothing happens, if I try to list
> inconsistent-obj I get No scrub information available for 12.12.  If I tell
> it to run a deep scrub, nothing.  I'll set debug and see what I can find in
> the logs.
>
Just to give a quick update. This one was my fault, I missed a flag. Once
set correctly, scrubbed and repaired.  It's now back to adding more PG's,
which continue to get a bit faster as it expands.  I'm now up to pg_num
1362 and pgp_num 1234, with backfills happening at 250-300 Mb/s 60-70
Objects/s.

Thanks for all the help.

>
>> Not sure what hardware you have, but you might benefit from disabling
>> write caches, see this link:
>>
>> https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
>>
>> Thanks, I'm disabling cache and I'll see if it helps at all.

> Gr. Stefan
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 9:55 PM Stefan Kooman  wrote:

> On 6/29/22 19:34, Curt wrote:
> > Hi Stefan,
> >
> > Thank you, that definitely helped. I bumped it to 20% for now and that's
> > giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll
> > see how that runs and then increase it a bit more if the cluster handles
> > it ok.
> >
> > Do you think it's worth enabling scrubbing while backfilling?
>
> If the cluster can cope with the extra load, sure. If it slows down the
> backfilling to levels that are too slow ... temporarily disable it.
>
> Since
> > this is going to take a while. I do have 1 inconsistent PG that has now
> > become 10 as it splits.
>
> Hmm. Well, if it finds broken PGs, for sure pause backfilling (ceph osd
> set nobackfill) and have it handle this ASAP: ceph pg repair $pg.
> Something is wrong, and you want to have this fixed sooner rather than
> later.
>

 When I try to run a repair nothing happens, if I try to list
inconsistent-obj I get No scrub information available for 12.12.  If I tell
it to run a deep scrub, nothing.  I'll set debug and see what I can find in
the logs.

>
> Not sure what hardware you have, but you might benefit from disabling
> write caches, see this link:
>
> https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
Hi Stefan,

Thank you, that definitely helped. I bumped it to 20% for now and that's
giving me around 124 PGs backfilling at 187 MiB/s, 47 Objects/s.  I'll see
how that runs and then increase it a bit more if the cluster handles it ok.

Do you think it's worth enabling scrubbing while backfilling?  Since this
is going to take a while. I do have 1 inconsistent PG that has now become
10 as it splits.

ceph health detail
HEALTH_ERR 21 scrub errors; Possible data damage: 10 pgs inconsistent; 2
pgs not deep-scrubbed in time
[ERR] OSD_SCRUB_ERRORS: 21 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 10 pgs inconsistent
pg 12.12 is active+clean+inconsistent, acting [28,1,37,0]
pg 12.32 is active+clean+inconsistent, acting [37,3,14,22]
pg 12.52 is active+clean+inconsistent, acting [4,33,7,23]
pg 12.72 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.92 is active+remapped+inconsistent+backfilling, acting [28,1,37,0]
pg 12.b2 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.d2 is active+clean+inconsistent, acting [4,33,7,23]
pg 12.f2 is active+remapped+inconsistent+backfilling, acting
[37,3,14,22]
pg 12.112 is active+clean+inconsistent, acting [28,1,37,0]
pg 12.132 is active+clean+inconsistent, acting [37,3,14,22]
[WRN] PG_NOT_DEEP_SCRUBBED: 2 pgs not deep-scrubbed in time
pg 4.13 not deep-scrubbed since 2022-06-16T03:15:16.758943+
pg 7.1 not deep-scrubbed since 2022-06-16T20:51:12.211259+

Thanks,
Curt

On Wed, Jun 29, 2022 at 5:53 PM Stefan Kooman  wrote:

> On 6/29/22 15:14, Curt wrote:
>
>
> >
> > Hi Stefan,
> >
> > Good to know.  I see the default if .05 for misplaced_ratio.  What do
> > you recommend would be a safe number to increase it to?
>
> It depends. It might be safe to put it to 1. But I would slowly increase
> it, have the manager increase pgp_num and see how the cluster copes with
> the increased load. If you have hardly any client workload you might
> bump this ratio quite a bit. At some point you would need to increase
> osd max backfill to avoid having PGs waiting on backfill.
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 4:42 PM Stefan Kooman  wrote:

> On 6/29/22 11:21, Curt wrote:
> > On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder  wrote:
> >
> >> Hi,
> >>
> >> did you wait for PG creation and peering to finish after setting pg_num
> >> and pgp_num? They should be right on the value you set and not lower.
> >>
> > Yes, only thing going on was backfill. It's still just slowly expanding
> pg
> > and pgp nums.   I even ran the set command again.  Here's the current
> info
> > ceph osd pool get EC-22-Pool all
> > size: 4
> > min_size: 3
> > pg_num: 226
> > pgp_num: 98
>
> This is coded in the mons and works like that from nautilus onwards:
>
> src/mon/OSDMonitor.cc
>
> ...
>  if (osdmap.require_osd_release < ceph_release_t::nautilus) {
>// pre-nautilus osdmap format; increase pg_num directly
>assert(n > (int)p.get_pg_num());
>// force pre-nautilus clients to resend their ops, since they
>// don't understand pg_num_target changes form a new interval
>p.last_force_op_resend_prenautilus = pending_inc.epoch;
>// force pre-luminous clients to resend their ops, since they
>// don't understand that split PGs now form a new interval.
>p.last_force_op_resend_preluminous = pending_inc.epoch;
>p.set_pg_num(n);
>  } else {
>// set targets; mgr will adjust pg_num_actual and pgp_num later.
>// make pgp_num track pg_num if it already matches.  if it is set
>// differently, leave it different and let the user control it
>// manually.
>if (p.get_pg_num_target() == p.get_pgp_num_target()) {
>  p.set_pgp_num_target(n);
>}
>p.set_pg_num_target(n);
>  }
> ...
>
> So, when pg_num and pgp_num are the same when pg_num is increased, it
> will slowly change pgp_num. If pgp_num is different (smaller, as it
> cannot be bigger than pg_num) it will not touch pgp_num.
>
> You might speed up this process by increasing "target_max_misplaced_ratio"
>
> Gr. Stefan
>

Hi Stefan,

Good to know.  I see the default if .05 for misplaced_ratio.  What do you
recommend would be a safe number to increase it to?

Thanks,
Curt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-29 Thread Curt
On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder  wrote:

> Hi,
>
> did you wait for PG creation and peering to finish after setting pg_num
> and pgp_num? They should be right on the value you set and not lower.
>
Yes, only thing going on was backfill. It's still just slowly expanding pg
and pgp nums.   I even ran the set command again.  Here's the current info
ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 226
pgp_num: 98
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: off
eio: false
bulk: false

>
> > How do you set the upmap balancer per pool?
>
> I'm afraid the answer is RTFM. I don't use it, but I believe to remember
> one could configure it for equi-distribution of PGs for each pool.
>
> Ok, I'll dig around some more. I glanced at the balancer page and didn't
see it.


> Whenever you grow the cluster, you should make the same considerations
> again and select numbers of PG per pool depending on number of objects,
> capacity and performance.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Curt 
> Sent: 28 June 2022 16:33:24
> To: Frank Schilder
> Cc: Robert Gallop; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Ceph recovery network speed
>
> Hi Frank,
>
> Thank you for the thorough breakdown. I have increased the pg_num and
> pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary
> pool with the most data.  It looks like ceph slowly scales the pg up even
> with autoscaling off, since I see target_pg_num 2048, pg_num 199.
>
> root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048
> set pool 12 pg_num to 2048
> root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048
> set pool 12 pgp_num to 2048
> root@cephmgr:/# ceph osd pool get EC-22-Pool all
> size: 4
> min_size: 3
> pg_num: 199
> pgp_num: 71
> crush_rule: EC-22-Pool
> hashpspool: true
> allow_ec_overwrites: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> erasure_code_profile: EC-22-Pro
> fast_read: 0
> pg_autoscale_mode: off
> eio: false
> bulk: false
>
> This cluster will be growing quit a bit over the next few months.  I am
> migrating data from their old Giant cluster to a new one, by the time I'm
> done it should be 16 hosts with about 400TB of data. I'm guessing I'll have
> to increase pg again later when I start adding more servers to the cluster.
>
> I will look into if SSD's are an option.  How do you set the upmap
> balancer per pool?  Looking at ceph balancer status my mode is already
> upmap.
>
> Thanks again,
> Curt
>
> On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder  fr...@dtu.dk>> wrote:
> Hi Curt,
>
> looking at what you sent here, I believe you are the victim of "the law of
> large numbers really only holds for large numbers". In other words, the
> statistics of small samples is biting you. The PG numbers of your pools are
> so low that they lead to a very large imbalance of data- and IO placement.
> In other words, in your cluster a few OSDs receive the majority of IO
> requests and bottleneck the entire cluster.
>
> If I see this correctly, the PG num per drive varies from 14 to 40. That's
> an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is
> only 48. The autoscaler is screwing it up for you. It will slowly increase
> the number of active PGs, causing continuous relocation of objects for a
> very long time.
>
> I think the recovery speed you see for 8 objects per second is not too bad
> considering that you have an HDD only cluster. The speed does not increase,
> because it is a small number of PGs sending data - a subset of the 32 you
> had before. In addition, due to the imbalance of PGs per OSD, only a small
> number of PGs will be able to send data. You will need patience to get out
> of this corner.
>
> The first thing I would do is look at which pools are important for your
> workload in the long run. I see 2 pools having a significant number of
> objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40
> times the number of objects and bytes as default.rgw.buckets.data. I would
> scale both up in PG count with emphasis on EC-22-Pool.
>
> Your cluster can safely operate between 1100 and 2200 PGs with replication
> <=4. If you don't plan to create more large pools, a good choice of
> distributin

[ceph-users] Re: Ceph recovery network speed

2022-06-28 Thread Curt
Hi Frank,

Thank you for the thorough breakdown. I have increased the pg_num and
pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary
pool with the most data.  It looks like ceph slowly scales the pg up even
with autoscaling off, since I see target_pg_num 2048, pg_num 199.

root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048
set pool 12 pg_num to 2048
root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048
set pool 12 pgp_num to 2048
root@cephmgr:/# ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 199
pgp_num: 71
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: off
eio: false
bulk: false

This cluster will be growing quit a bit over the next few months.  I am
migrating data from their old Giant cluster to a new one, by the time I'm
done it should be 16 hosts with about 400TB of data. I'm guessing I'll have
to increase pg again later when I start adding more servers to the cluster.

I will look into if SSD's are an option.  How do you set the upmap balancer
per pool?  Looking at ceph balancer status my mode is already upmap.

Thanks again,
Curt

On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder  wrote:

> Hi Curt,
>
> looking at what you sent here, I believe you are the victim of "the law of
> large numbers really only holds for large numbers". In other words, the
> statistics of small samples is biting you. The PG numbers of your pools are
> so low that they lead to a very large imbalance of data- and IO placement.
> In other words, in your cluster a few OSDs receive the majority of IO
> requests and bottleneck the entire cluster.
>
> If I see this correctly, the PG num per drive varies from 14 to 40. That's
> an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is
> only 48. The autoscaler is screwing it up for you. It will slowly increase
> the number of active PGs, causing continuous relocation of objects for a
> very long time.
>
> I think the recovery speed you see for 8 objects per second is not too bad
> considering that you have an HDD only cluster. The speed does not increase,
> because it is a small number of PGs sending data - a subset of the 32 you
> had before. In addition, due to the imbalance of PGs per OSD, only a small
> number of PGs will be able to send data. You will need patience to get out
> of this corner.
>
> The first thing I would do is look at which pools are important for your
> workload in the long run. I see 2 pools having a significant number of
> objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40
> times the number of objects and bytes as default.rgw.buckets.data. I would
> scale both up in PG count with emphasis on EC-22-Pool.
>
> Your cluster can safely operate between 1100 and 2200 PGs with replication
> <=4. If you don't plan to create more large pools, a good choice of
> distributing this capacity might be
>
> EC-22-Pool: 1024 PGs (could be pushed up to 2048)
> default.rgw.buckets.data: 256 PGs
>
> That's towards the lower end of available PGs. Please make your own
> calculation and judgement.
>
> If you have settled on target numbers, change the pool sizes in one go,
> that is, set PG_num and PGP_num to the same value right away. You might
> need to turn autoscaler off for these 2 pools. The rebalancing will take a
> long time and also not speed up, because the few sending PGs are the
> bottleneck, not the receiving ones. You will have to sit it out.
>
> The goal is that, in the future, recovery and re-balancing are improved.
> In my experience, a reasonably high PG count will also reduce latency of
> client IO.
>
> Next thing to look at is distribution of PGs per OSD. This has an enormous
> performance impact, because a few too busy OSDs can throttle an entire
> cluster (its always the slowest disk that wins). I use the very simple
> reweight by utilization method, but my pools do not share OSDs as yours do.
> You might want to try the upmap balancer per pool to get PGs per pool
> evenly spread out over OSDs.
>
> Lastly, if you can afford it and your hosts have a slot left, consider
> buying one enterprise SSD per host for the meta-data pools to get this IO
> away from the HDDs. If you buy a bunch of 128G or 256G SATA SSDs, you can
> probably place everything except the EC-22-Pool on these drives, separating
> completely.
>
> Hope that helps and maybe someone else has ideas as well?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Curt 
> Sent: 27 Ju

[ceph-users] Re: Ceph recovery network speed

2022-06-27 Thread Curt
   298 GiB5 KiB  1.4
GiB  1.5 TiB  16.05  0.61   26  up  osd.23
24hdd   1.81940   1.0  1.8 TiB   735 GiB   733 GiB8 KiB  2.3
GiB  1.1 TiB  39.45  1.50   33  up  osd.24
25hdd   1.81940   1.0  1.8 TiB   519 GiB   517 GiB5 KiB  1.4
GiB  1.3 TiB  27.85  1.06   26  up  osd.25
26hdd   1.81940   1.0  1.8 TiB   483 GiB   481 GiB  614 KiB  1.7
GiB  1.3 TiB  25.94  0.99   28  up  osd.26
27hdd   1.81940   1.0  1.8 TiB   226 GiB   225 GiB  1.5 MiB  1.0
GiB  1.6 TiB  12.11  0.46   17  up  osd.27
28hdd   1.81940   1.0  1.8 TiB   443 GiB   441 GiB   24 KiB  1.5
GiB  1.4 TiB  23.76  0.91   21  up  osd.28
29hdd   1.81940   1.0  1.8 TiB   801 GiB   799 GiB7 KiB  2.2
GiB  1.0 TiB  42.98  1.64   31  up  osd.29
30hdd   1.81940   1.0  1.8 TiB   523 GiB   522 GiB  174 KiB  1.2
GiB  1.3 TiB  28.09  1.07   29  up  osd.30
31hdd   1.81940   1.0  1.8 TiB   322 GiB   321 GiB4 KiB  1.2
GiB  1.5 TiB  17.30  0.66   26  up  osd.31
44hdd   1.81940   1.0  1.8 TiB   541 GiB   540 GiB  136 KiB  1.4
GiB  1.3 TiB  29.06  1.11   24  up  osd.44
-9 20.01337 -   20 TiB   5.3 TiB   5.2 TiB   25 MiB   16
GiB   15 TiB  26.25  1.00-  host hyperion04
33hdd   1.81940   1.0  1.8 TiB   466 GiB   465 GiB  469 KiB  1.4
GiB  1.4 TiB  25.02  0.95   28  up  osd.33
34hdd   1.81940   1.0  1.8 TiB   508 GiB   506 GiB2 KiB  1.8
GiB  1.3 TiB  27.28  1.04   30  up  osd.34
35hdd   1.81940   1.0  1.8 TiB   521 GiB   520 GiB2 KiB  1.4
GiB  1.3 TiB  27.98  1.07   32  up  osd.35
36hdd   1.81940   1.0  1.8 TiB   872 GiB   870 GiB3 KiB  2.3
GiB  991 GiB  46.81  1.78   40  up  osd.36
37hdd   1.81940   1.0  1.8 TiB   443 GiB   441 GiB  136 KiB  1.2
GiB  1.4 TiB  23.75  0.91   25  up  osd.37
38hdd   1.81940   1.0  1.8 TiB   138 GiB   137 GiB   24 MiB  647
MiB  1.7 TiB   7.40  0.28   27  up  osd.38
39hdd   1.81940   1.0  1.8 TiB   638 GiB   637 GiB  622 KiB  1.7
GiB  1.2 TiB  34.26  1.31   33  up  osd.39
40hdd   1.81940   1.0  1.8 TiB   444 GiB   443 GiB   14 KiB  1.4
GiB  1.4 TiB  23.85  0.91   25  up  osd.40
41hdd   1.81940   1.0  1.8 TiB   477 GiB   476 GiB  264 KiB  1.3
GiB  1.4 TiB  25.60  0.98   31  up  osd.41
42hdd   1.81940   1.0  1.8 TiB   514 GiB   513 GiB   35 KiB  1.2
GiB  1.3 TiB  27.61  1.05   29  up  osd.42
43hdd   1.81940   1.0  1.8 TiB   358 GiB   356 GiB  111 KiB  1.2
GiB  1.5 TiB  19.19  0.73   24  up  osd.43
TOTAL   80 TiB21 TiB21 TiB   32 MiB   69
GiB   59 TiB  26.23
MIN/MAX VAR: 0.12/2.36  STDDEV: 12.47

>
> The number of objects in flight looks small. Your objects seem to have an
> average size of 4MB and should recover with full bandwidth. Check with top
> how much IO wait percentage you have on the OSD hosts.
>
iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell.


>
> The one thing that jumps to my eye though is, that you only have 22 dirty
> PGs and they are all recovering/backfilling already. I wonder if you have a
> problem with your crush rules, they might not do what you think they do.
> You said you increased the PG count for EC-22-Pool to 128 (from what?) but
> it doesn't really look like a suitable number of PGs has been marked for
> backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"?
>
From 32 to 128
ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 128
pgp_num: 48
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: on
eio: false
bulk: false



>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Curt 
> Sent: 27 June 2022 19:41:06
> To: Robert Gallop
> Cc: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Ceph recovery network speed
>
> I would love to see those types of speeds. I tried setting it all the way
> to 0 and nothing, I did that before I sent the first email, maybe it was
> your old post I got it from.
>
> osd_recovery_sleep_hdd   0.00
>
>
>  override  (mon[0.00])
>
> On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop  <mailto:robert.gal...@gmail.com>> wrote:
> I saw a major boost after having the sleep_hdd set to 0.  Only after that
> did I start staying at around 50

[ceph-users] Re: Ceph recovery network speed

2022-06-27 Thread Curt
if you have a 10G
>> > line. However, recovery is something completely different from a full
>> > link-speed copy.
>> >
>> > I can tell you that boatloads of tiny objects are a huge pain for
>> > recovery, even on SSD. Ceph doesn't raid up sections of disks against
>> each
>> > other, but object for object. This might be a feature request: that PG
>> > space allocation and recovery should follow the model of LVM extends
>> > (ideally match with LVM extends) to allow recovery/rebalancing larger
>> > chunks of storage in one go, containing parts of a large or many small
>> > objects.
>> >
>> > Best regards,
>> > =
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > 
>> > From: Curt 
>> > Sent: 27 June 2022 17:35:19
>> > To: Frank Schilder
>> > Cc: ceph-users@ceph.io
>> > Subject: Re: [ceph-users] Re: Ceph recovery network speed
>> >
>> > Hello,
>> >
>> > I had already increased/changed those variables previously.  I increased
>> > the pg_num to 128. Which increased the number of PG's backfilling, but
>> > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for
>> the
>> > last several hours.  Should I increase it higher than 128?
>> >
>> > I'm still trying to figure out if this is just how ceph is or if there
>> is
>> > a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
>> > done in a couple min or less.  Am I thinking of this wrong?
>> >
>> > Thanks,
>> > Curt
>> >
>> > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder > > fr...@dtu.dk>> wrote:
>> > Hi Curt,
>> >
>> > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD
>> per
>> > host busy. My experience is, that the algorithm for selecting PGs to
>> > backfill/recover is not very smart. It could simply be that it doesn't
>> find
>> > more PGs without violating some of these settings:
>> >
>> > osd_max_backfills
>> > osd_recovery_max_active
>> >
>> > I have never observed the second parameter to change anything (try any
>> > ways). However, the first one has a large impact. You could try
>> increasing
>> > this slowly until recovery moves faster. Another parameter you might
>> want
>> > to try is
>> >
>> > osd_recovery_sleep_[hdd|ssd]
>> >
>> > Be careful as this will impact client IO. I could reduce the sleep for
>> my
>> > HDDs to 0.05. With your workload pattern, this might be something you
>> can
>> > tune as well.
>> >
>> > Having said that, I think you should increase your PG count on the EC
>> pool
>> > as soon as the cluster is healthy. You have only about 20 PGs per OSD
>> and
>> > large PGs will take unnecessarily long to recover. A higher PG count
>> will
>> > also make it easier for the scheduler to find PGs for recovery/backfill.
>> > Aim for a number between 100 and 200. Give the pool(s) with most data
>> > (#objects) the most PGs.
>> >
>> > Best regards,
>> > =
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > 
>> > From: Curt mailto:light...@gmail.com>>
>> > Sent: 24 June 2022 19:04
>> > To: Anthony D'Atri; ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>> > Subject: [ceph-users] Re: Ceph recovery network speed
>> >
>> > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
>> enterprise
>> > HD's.
>> >
>> > Take this log entry below, 72 minutes and still backfilling undersized?
>> > Should it be that slow?
>> >
>> > pg 12.15 is stuck undersized for 72m, current state
>> > active+undersized+degraded+remapped+backfilling, last acting
>> > [34,10,29,NONE]
>> >
>> > Thanks,
>> > Curt
>> >
>> >
>> > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri > > <mailto:anthony.da...@gmail.com>>
>> > wrote:
>> >
>> > > Your recovery is slow *because* there are only 2 PGs backfilling.
>> > >
>> > > What kind of OSD media are you using?
>> > >
>> > > > On Jun 24, 2022, at 09:46, Curt > > light...@gmail.com>> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I'm trying to understand why my recovery is so slow with only 2 pg
>> > > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G
>> network.  I
>> > > > have tested the speed between machines with a few tools and all
>> confirm
>> > > 10G
>> > > > speed.  I've tried changing various settings of priority and
>> recovery
>> > > sleep
>> > > > hdd, but still the same. Is this a configuration issue or something
>> > else?
>> > > >
>> > > > It's just a small cluster right now with 4 hosts, 11 osd's per.
>> Please
>> > > let
>> > > > me know if you need more information.
>> > > >
>> > > > Thanks,
>> > > > Curt
>> > > > ___
>> > > > ceph-users mailing list -- ceph-users@ceph.io> > ceph-users@ceph.io>
>> > > > To unsubscribe send an email to ceph-users-le...@ceph.io> > ceph-users-le...@ceph.io>
>> > >
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io
>> >
>> > To unsubscribe send an email to ceph-users-le...@ceph.io> > ceph-users-le...@ceph.io>
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-27 Thread Robert Gallop
I saw a major boost after having the sleep_hdd set to 0.  Only after that
did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k
obj/sec.

Eventually it tapered back down, but for me sleep was the key, and
specifically in my case:

osd_recovery_sleep_hdd

On Mon, Jun 27, 2022 at 11:17 AM Curt  wrote:

> On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder  wrote:
>
> > I think this is just how ceph is. Maybe you should post the output of
> > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> > idea whether what you look at is expected or not. As I wrote before,
> object
> > recovery is throttled and the recovery bandwidth depends heavily on
> object
> > size. The interesting question is, how many objects per second are
> > recovered/rebalanced
> >
>  data:
> pools:   11 pools, 369 pgs
> objects: 2.45M objects, 9.2 TiB
> usage:   20 TiB used, 60 TiB / 80 TiB avail
> pgs: 512136/9729081 objects misplaced (5.264%)
>  343 active+clean
>  22  active+remapped+backfilling
>
>   io:
> client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
> recovery: 34 MiB/s, 8 objects/s
>
> Pool 12 is the only one with any stats.
>
> pool EC-22-Pool id 12
>   510048/9545052 objects misplaced (5.344%)
>   recovery io 36 MiB/s, 9 objects/s
>   client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr
>
> --- RAW STORAGE ---
> CLASSSIZE   AVAILUSED  RAW USED  %RAW USED
> hdd80 TiB  60 TiB  20 TiB20 TiB  25.45
> TOTAL  80 TiB  60 TiB  20 TiB20 TiB  25.45
>
> --- POOLS ---
> POOLID  PGS   STORED  OBJECTS USED  %USED  MAX
> AVAIL
> .mgr 11  152 MiB   38  457 MiB  0
>  9.2 TiB
> 21BadPool3   328 KiB1   12 KiB  0
> 18 TiB
> .rgw.root4   32  1.3 KiB4   48 KiB  0
>  9.2 TiB
> default.rgw.log  5   32  3.6 KiB  209  408 KiB  0
>  9.2 TiB
> default.rgw.control  6   32  0 B8  0 B  0
>  9.2 TiB
> default.rgw.meta 78  6.7 KiB   20  203 KiB  0
>  9.2 TiB
> rbd_rep_pool 8   32  2.0 MiB5  5.9 MiB  0
>  9.2 TiB
> default.rgw.buckets.index98  2.0 MiB   33  5.9 MiB  0
>  9.2 TiB
> default.rgw.buckets.non-ec  10   32  1.4 KiB0  4.3 KiB  0
>  9.2 TiB
> default.rgw.buckets.data11   32  232 GiB   61.02k  697 GiB   2.41
>  9.2 TiB
> EC-22-Pool  12  128  9.8 TiB2.39M   20 TiB  41.55
> 14 TiB
>
>
>
> > Maybe provide the output of the first two commands for
> > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a
> bit
> > after setting these and then collect the output). Include the applied
> > values for osd_max_backfills* and osd_recovery_max_active* for one of the
> > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> > osd_recovery_max_active).
> >
>
> I didn't notice any speed difference with sleep values changed, but I'll
> grab the stats between changes when I have a chance.
>
> ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
> osd_max_backfills1000
>
>
> override  mon[5]
> osd_recovery_max_active  1000
>
>
> override
> osd_recovery_max_active_hdd  1000
>
>
> override  mon[5]
> osd_recovery_max_active_ssd  1000
>
>
> override
>
> >
> > I don't really know if on such a small cluster one can expect more than
> > what you see. It has nothing to do with network speed if you have a 10G
> > line. However, recovery is something completely different from a full
> > link-speed copy.
> >
> > I can tell you that boatloads of tiny objects are a huge pain for
> > recovery, even on SSD. Ceph doesn't raid up sections of disks against
> each
> > other, but object for object. This might be a feature request: that PG
> > space allocation and recovery should follow the model of LVM extends
> > (ideally match with LVM extends) to allow recovery/rebalancing larger
> > chunks of storage in one go, containing parts of a large or many small
> > objects.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > __

[ceph-users] Re: Ceph recovery network speed

2022-06-27 Thread Curt
On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder  wrote:

> I think this is just how ceph is. Maybe you should post the output of
> "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> idea whether what you look at is expected or not. As I wrote before, object
> recovery is throttled and the recovery bandwidth depends heavily on object
> size. The interesting question is, how many objects per second are
> recovered/rebalanced
>
 data:
pools:   11 pools, 369 pgs
objects: 2.45M objects, 9.2 TiB
usage:   20 TiB used, 60 TiB / 80 TiB avail
pgs: 512136/9729081 objects misplaced (5.264%)
 343 active+clean
 22  active+remapped+backfilling

  io:
client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
recovery: 34 MiB/s, 8 objects/s

Pool 12 is the only one with any stats.

pool EC-22-Pool id 12
  510048/9545052 objects misplaced (5.344%)
  recovery io 36 MiB/s, 9 objects/s
  client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr

--- RAW STORAGE ---
CLASSSIZE   AVAILUSED  RAW USED  %RAW USED
hdd80 TiB  60 TiB  20 TiB20 TiB  25.45
TOTAL  80 TiB  60 TiB  20 TiB20 TiB  25.45

--- POOLS ---
POOLID  PGS   STORED  OBJECTS USED  %USED  MAX
AVAIL
.mgr 11  152 MiB   38  457 MiB  0
 9.2 TiB
21BadPool3   328 KiB1   12 KiB  0
18 TiB
.rgw.root4   32  1.3 KiB4   48 KiB  0
 9.2 TiB
default.rgw.log  5   32  3.6 KiB  209  408 KiB  0
 9.2 TiB
default.rgw.control  6   32  0 B8  0 B  0
 9.2 TiB
default.rgw.meta 78  6.7 KiB   20  203 KiB  0
 9.2 TiB
rbd_rep_pool 8   32  2.0 MiB5  5.9 MiB  0
 9.2 TiB
default.rgw.buckets.index98  2.0 MiB   33  5.9 MiB  0
 9.2 TiB
default.rgw.buckets.non-ec  10   32  1.4 KiB0  4.3 KiB  0
 9.2 TiB
default.rgw.buckets.data11   32  232 GiB   61.02k  697 GiB   2.41
 9.2 TiB
EC-22-Pool  12  128  9.8 TiB2.39M   20 TiB  41.55
14 TiB



> Maybe provide the output of the first two commands for
> osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a bit
> after setting these and then collect the output). Include the applied
> values for osd_max_backfills* and osd_recovery_max_active* for one of the
> OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> osd_recovery_max_active).
>

I didn't notice any speed difference with sleep values changed, but I'll
grab the stats between changes when I have a chance.

ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
osd_max_backfills1000


override  mon[5]
osd_recovery_max_active  1000


override
osd_recovery_max_active_hdd  1000


override  mon[5]
osd_recovery_max_active_ssd  1000


override

>
> I don't really know if on such a small cluster one can expect more than
> what you see. It has nothing to do with network speed if you have a 10G
> line. However, recovery is something completely different from a full
> link-speed copy.
>
> I can tell you that boatloads of tiny objects are a huge pain for
> recovery, even on SSD. Ceph doesn't raid up sections of disks against each
> other, but object for object. This might be a feature request: that PG
> space allocation and recovery should follow the model of LVM extends
> (ideally match with LVM extends) to allow recovery/rebalancing larger
> chunks of storage in one go, containing parts of a large or many small
> objects.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____________
> From: Curt 
> Sent: 27 June 2022 17:35:19
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: Ceph recovery network speed
>
> Hello,
>
> I had already increased/changed those variables previously.  I increased
> the pg_num to 128. Which increased the number of PG's backfilling, but
> speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the
> last several hours.  Should I increase it higher than 128?
>
> I'm still trying to figure out if this is just how ceph is or if there is
> a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
> done in a couple min or less.  Am I thinking of this wrong?
>
> Thanks,
> Curt
>
> On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder  fr...@dtu.dk>> wrote:
> Hi Curt,
>
> as far as I understood, a 2+2

[ceph-users] Re: Ceph recovery network speed

2022-06-27 Thread Curt
Hello,

I had already increased/changed those variables previously.  I increased
the pg_num to 128. Which increased the number of PG's backfilling, but
speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the
last several hours.  Should I increase it higher than 128?

I'm still trying to figure out if this is just how ceph is or if there is a
bottleneck somewhere.  Like if I sftp a 10G file between servers it's done
in a couple min or less.  Am I thinking of this wrong?

Thanks,
Curt

On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder  wrote:

> Hi Curt,
>
> as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per
> host busy. My experience is, that the algorithm for selecting PGs to
> backfill/recover is not very smart. It could simply be that it doesn't find
> more PGs without violating some of these settings:
>
> osd_max_backfills
> osd_recovery_max_active
>
> I have never observed the second parameter to change anything (try any
> ways). However, the first one has a large impact. You could try increasing
> this slowly until recovery moves faster. Another parameter you might want
> to try is
>
> osd_recovery_sleep_[hdd|ssd]
>
> Be careful as this will impact client IO. I could reduce the sleep for my
> HDDs to 0.05. With your workload pattern, this might be something you can
> tune as well.
>
> Having said that, I think you should increase your PG count on the EC pool
> as soon as the cluster is healthy. You have only about 20 PGs per OSD and
> large PGs will take unnecessarily long to recover. A higher PG count will
> also make it easier for the scheduler to find PGs for recovery/backfill.
> Aim for a number between 100 and 200. Give the pool(s) with most data
> (#objects) the most PGs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Curt 
> Sent: 24 June 2022 19:04
> To: Anthony D'Atri; ceph-users@ceph.io
> Subject: [ceph-users] Re: Ceph recovery network speed
>
> 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB enterprise
> HD's.
>
> Take this log entry below, 72 minutes and still backfilling undersized?
> Should it be that slow?
>
> pg 12.15 is stuck undersized for 72m, current state
> active+undersized+degraded+remapped+backfilling, last acting
> [34,10,29,NONE]
>
> Thanks,
> Curt
>
>
> On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri 
> wrote:
>
> > Your recovery is slow *because* there are only 2 PGs backfilling.
> >
> > What kind of OSD media are you using?
> >
> > > On Jun 24, 2022, at 09:46, Curt  wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G network.  I
> > > have tested the speed between machines with a few tools and all confirm
> > 10G
> > > speed.  I've tried changing various settings of priority and recovery
> > sleep
> > > hdd, but still the same. Is this a configuration issue or something
> else?
> > >
> > > It's just a small cluster right now with 4 hosts, 11 osd's per.  Please
> > let
> > > me know if you need more information.
> > >
> > > Thanks,
> > > Curt
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
On Sat, Jun 25, 2022 at 3:27 AM Anthony D'Atri 
wrote:

> The pg_autoscaler aims IMHO way too low and I advise turning it off.
>
>
>
> > On Jun 24, 2022, at 11:11 AM, Curt  wrote:
> >
> >> You wrote 2TB before, are they 2TB or 18TB?  Is that 273 PGs total or
> per
> > osd?
> > Sorry, 18TB of data and 273 PGs total.
> >
> >> `ceph osd df` will show you toward the right how many PGs are on each
> > OSD.  If you have multiple pools, some PGs will have more data than
> others.
> >> So take an average # of PGs per OSD and divide the actual HDD capacity
> > by that.
> > 20 pg on avg / 2TB(technically 1.8 I guess) which would be 10.
>
> I’m confused.  Is 20 what `ceph osd df` is reporting?  Send me the output
> of

Yes, 20 would be the avg pg count.
 ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA  OMAP META
  AVAIL%USE   VAR   PGS  STATUS
 1hdd  1.81940   1.0  1.8 TiB  748 GiB   746 GiB  207 KiB  1.7 GiB
 1.1 TiB  40.16  1.68   21  up
 3hdd  1.81940   1.0  1.8 TiB  459 GiB   457 GiB3 KiB  1.2 GiB
 1.4 TiB  24.61  1.03   20  up
 5hdd  1.81940   1.0  1.8 TiB  153 GiB   152 GiB   32 KiB  472 MiB
 1.7 TiB   8.20  0.34   15  up
 7hdd  1.81940   1.0  1.8 TiB  471 GiB   470 GiB   83 KiB  1.0 GiB
 1.4 TiB  25.27  1.06   24  up
 9hdd  1.81940   1.0  1.8 TiB  1.0 TiB  1022 GiB  136 KiB  2.4 GiB
 838 GiB  54.99  2.30   19  up
11hdd  1.81940   1.0  1.8 TiB  443 GiB   441 GiB4 KiB  1.1 GiB
 1.4 TiB  23.76  0.99   20  up
13hdd  1.81940   1.0  1.8 TiB  438 GiB   437 GiB  310 KiB  1.0 GiB
 1.4 TiB  23.50  0.98   18  up
15hdd  1.81940   1.0  1.8 TiB  334 GiB   333 GiB  621 KiB  929 MiB
 1.5 TiB  17.92  0.75   15  up
17hdd  1.81940   1.0  1.8 TiB  310 GiB   309 GiB2 KiB  807 MiB
 1.5 TiB  16.64  0.70   20  up
19hdd  1.81940   1.0  1.8 TiB  433 GiB   432 GiB7 KiB  974 MiB
 1.4 TiB  23.23  0.97   25  up
45hdd  1.81940   1.0  1.8 TiB  169 GiB   169 GiB2 KiB  615 MiB
 1.7 TiB   9.09  0.38   18  up
 0hdd  1.81940   1.0  1.8 TiB  582 GiB   580 GiB  295 KiB  1.7 GiB
 1.3 TiB  31.24  1.31   21  up
 2hdd  1.81940   1.0  1.8 TiB  870 MiB21 MiB  112 KiB  849 MiB
 1.8 TiB   0.05  0.00   14  up
 4hdd  1.81940   1.0  1.8 TiB  326 GiB   325 GiB   14 KiB  947 MiB
 1.5 TiB  17.48  0.73   24  up
 6hdd  1.81940   1.0  1.8 TiB  450 GiB   448 GiB1 KiB  1.4 GiB
 1.4 TiB  24.13  1.01   17  up
 8hdd  1.81940   1.0  1.8 TiB  152 GiB   152 GiB  618 KiB  900 MiB
 1.7 TiB   8.18  0.34   20  up
10hdd  1.81940   1.0  1.8 TiB  609 GiB   607 GiB4 KiB  1.7 GiB
 1.2 TiB  32.67  1.37   25  up
12hdd  1.81940   1.0  1.8 TiB  333 GiB   332 GiB  175 KiB  1.5 GiB
 1.5 TiB  17.89  0.75   24  up
14hdd  1.81940   1.0  1.8 TiB  1.0 TiB   1.0 TiB1 KiB  2.2 GiB
 834 GiB  55.24  2.31   17  up
16hdd  1.81940   1.0  1.8 TiB  168 GiB   167 GiB4 KiB  1.2 GiB
 1.7 TiB   9.03  0.38   15  up
18hdd  1.81940   1.0  1.8 TiB  299 GiB   298 GiB  261 KiB  1.6 GiB
 1.5 TiB  16.07  0.67   15  up
32hdd  1.81940   1.0  1.8 TiB  873 GiB   871 GiB   45 KiB  2.3 GiB
 990 GiB  46.88  1.96   18  up
22hdd  1.81940   1.0  1.8 TiB  449 GiB   447 GiB  139 KiB  1.6 GiB
 1.4 TiB  24.10  1.01   22  up
23hdd  1.81940   1.0  1.8 TiB  299 GiB   298 GiB5 KiB  1.6 GiB
 1.5 TiB  16.06  0.67   20  up
24hdd  1.81940   1.0  1.8 TiB  887 GiB   885 GiB8 KiB  2.4 GiB
 976 GiB  47.62  1.99   23  up
25hdd  1.81940   1.0  1.8 TiB  451 GiB   449 GiB4 KiB  1.6 GiB
 1.4 TiB  24.20  1.01   17  up
26hdd  1.81940   1.0  1.8 TiB  602 GiB   600 GiB  373 KiB  2.0 GiB
 1.2 TiB  32.29  1.35   21  up
27hdd  1.81940   1.0  1.8 TiB  152 GiB   151 GiB  1.5 MiB  564 MiB
 1.7 TiB   8.14  0.34   14  up
28hdd  1.81940   1.0  1.8 TiB  330 GiB   328 GiB7 KiB  1.6 GiB
 1.5 TiB  17.70  0.74   12  up
29hdd  1.81940   1.0  1.8 TiB  726 GiB   723 GiB7 KiB  2.1 GiB
 1.1 TiB  38.94  1.63   16  up
30hdd  1.81940   1.0  1.8 TiB  596 GiB   594 GiB  173 KiB  2.0 GiB
 1.2 TiB  32.01  1.34   19  up
31hdd  1.81940   1.0  1.8 TiB  304 GiB   303 GiB4 KiB  1.6 GiB
 1.5 TiB  16.34  0.68   20  up
44hdd  1.81940   1.0  1.8 TiB  150 GiB   149 GiB  0 B  599 MiB
 1.7 TiB   8.03  0.34   12  up
33hdd  1.81940   1.0  1.8 TiB  451 GiB   449 GiB  462 KiB  1.8 GiB
 1.4 TiB  24.22  1.01   19  up
34hdd  1.81940   1.0  1.8 TiB  449 GiB   448 GiB2 KiB  966 MiB
 1.4 TiB  24.12  1.01   21  up
35hdd  1.81940   1.0  1.8 TiB  458 GiB   457 GiB2 KiB  1.5 GiB
 1.4 TiB  24.60  1.03   23  up
36hdd  1.81940   1.0  1.8 TiB  872 GiB   870 GiB3 KiB  2.4 GiB
 991 GiB  46.81  1.96   22  up
37hdd  1.81940   1.0  1.8 TiB  443 GiB   441 GiB  136 KiB  

[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
Nope, majority of read/writes happen at night so it's doing less than 1
MiB/s client io right now, sometimes 0.

On Fri, Jun 24, 2022, 22:23 Stefan Kooman  wrote:

> On 6/24/22 20:09, Curt wrote:
> >
> >
> > On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman  > > wrote:
> >
> > On 6/24/22 19:49, Curt wrote:
> >  > Pool 12 is my erasure coding pool, 2+2.  How can I tell if it's
> >  > objections or keys recovering?\
> >
> > ceph -s. wil tell you what type of recovery is going on.
> >
> > Is it a cephfs metadata pool? Or a rgw index pool?
> >
> > Gr. Stefan
> >
> >
> > object recovery, I guess I'm used to it always showing object, so didn't
> > know it could be key.
> >
> > rbd pool.
>
> recovery has lower priority than client IO. Is the cluster busy?
>
> Gr. Stefan
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
> You wrote 2TB before, are they 2TB or 18TB?  Is that 273 PGs total or per
osd?
Sorry, 18TB of data and 273 PGs total.

> `ceph osd df` will show you toward the right how many PGs are on each
OSD.  If you have multiple pools, some PGs will have more data than others.
>  So take an average # of PGs per OSD and divide the actual HDD capacity
by that.
20 pg on avg / 2TB(technically 1.8 I guess) which would be 10.  Shouldn't
that be used though, not capacity? My usage is only 23% capacity.  I
thought ceph autoscalling pg's changed the size dynamically according to
usage?  I'm guessing I'm misunderstanding that part?

Thanks,
Curt

On Fri, Jun 24, 2022 at 9:48 PM Anthony D'Atri 
wrote:

>
> > Yes, SATA, I think my benchmark put it around 125, but that was a year
> ago, so could be misremembering
>
> A FIO benchmark, especially a sequential one on an empty drive, can
> mislead as to the real-world performance one sees on a fragmented drive.
>
> >  273 pg at 18TB so each PG would be 60G.
>
> You wrote 2TB before, are they 2TB or 18TB?  Is that 273 PGs total or per
> osd?
>
> >  Mainly used for RBD, using erasure coding.  cephadm bootstrap with
> docker images.
>
> Ack.  Have to account for replication.
>
> `ceph osd df` will show you toward the right how many PGs are on each
> OSD.  If you have multiple pools, some PGs will have more data than others.
>
> So take an average # of PGs per OSD and divide the actual HDD capacity by
> that.
>
>
>
>
> >
> > On Fri, Jun 24, 2022 at 9:21 PM Anthony D'Atri 
> wrote:
> >
> >
> > >
> > > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
> enterprise HD's.
> >
> > SATA? Figure they can write at 70 MB/s
> >
> > How big are your PGs?  What is your cluster used for?  RBD? RGW? CephFS?
> >
> > >
> > > Take this log entry below, 72 minutes and still backfilling
> undersized?  Should it be that slow?
> > >
> > > pg 12.15 is stuck undersized for 72m, current state
> active+undersized+degraded+remapped+backfilling, last acting [34,10,29,NONE]
> > >
> > > Thanks,
> > > Curt
> > >
> > >
> > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <
> anthony.da...@gmail.com> wrote:
> > > Your recovery is slow *because* there are only 2 PGs backfilling.
> > >
> > > What kind of OSD media are you using?
> > >
> > > > On Jun 24, 2022, at 09:46, Curt  wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G
> network.  I
> > > > have tested the speed between machines with a few tools and all
> confirm 10G
> > > > speed.  I've tried changing various settings of priority and
> recovery sleep
> > > > hdd, but still the same. Is this a configuration issue or something
> else?
> > > >
> > > > It's just a small cluster right now with 4 hosts, 11 osd's per.
> Please let
> > > > me know if you need more information.
> > > >
> > > > Thanks,
> > > > Curt
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
On Fri, Jun 24, 2022 at 10:00 PM Stefan Kooman  wrote:

> On 6/24/22 19:49, Curt wrote:
> > Pool 12 is my erasure coding pool, 2+2.  How can I tell if it's
> > objections or keys recovering?\
>
> ceph -s. wil tell you what type of recovery is going on.
>
> Is it a cephfs metadata pool? Or a rgw index pool?
>
> Gr. Stefan
>

object recovery, I guess I'm used to it always showing object, so didn't
know it could be key.

rbd pool.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
Pool 12 is my erasure coding pool, 2+2.  How can I tell if it's objections
or keys recovering?

Thanks,
Curt

On Fri, Jun 24, 2022 at 9:39 PM Stefan Kooman  wrote:

> On 6/24/22 19:04, Curt wrote:
> > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
> enterprise
> > HD's.
> >
> > Take this log entry below, 72 minutes and still backfilling undersized?
> > Should it be that slow?
> >
> > pg 12.15 is stuck undersized for 72m, current state
> > active+undersized+degraded+remapped+backfilling, last acting
> [34,10,29,NONE]
>
> What is in that pool 12? Is it objects that are recovering, or keys?
> OMAP data (keys) is slow.
>
> Gr. Stefan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph recovery network speed

2022-06-24 Thread Curt
2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB enterprise
HD's.

Take this log entry below, 72 minutes and still backfilling undersized?
Should it be that slow?

pg 12.15 is stuck undersized for 72m, current state
active+undersized+degraded+remapped+backfilling, last acting [34,10,29,NONE]

Thanks,
Curt


On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri 
wrote:

> Your recovery is slow *because* there are only 2 PGs backfilling.
>
> What kind of OSD media are you using?
>
> > On Jun 24, 2022, at 09:46, Curt  wrote:
> >
> > Hello,
> >
> > I'm trying to understand why my recovery is so slow with only 2 pg
> > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G network.  I
> > have tested the speed between machines with a few tools and all confirm
> 10G
> > speed.  I've tried changing various settings of priority and recovery
> sleep
> > hdd, but still the same. Is this a configuration issue or something else?
> >
> > It's just a small cluster right now with 4 hosts, 11 osd's per.  Please
> let
> > me know if you need more information.
> >
> > Thanks,
> > Curt
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph] Recovery is very Slow

2021-10-28 Thread Christian Wuerdig
Yes, just expose each disk as an individual OSD and you'll already be
better off. Depending what type of SSD they are - if they can sustain
high random write IOPS you may even want to consider partitioning each
disk and create 2 OSDs per SSD to make better use of the available IO
capacity.
For all-flash storage CPU utilization is also a factor - generally
fewer cores with a higher clock speed would be preferred over a cpu
with more cores but lower clock speeds in such a setup.


On Thu, 28 Oct 2021 at 21:25, Lokendra Rathour
 wrote:
>
> Hey Janne,
> Thanks for the feedback, we only wanted to have huge space to test more with 
> more data. do you advise some other way to plan this out?
> So I have 15 disks with 1 TB each.  Creating multiple OSD would help or 
> please advise.
>
> thanks,
> Lokendra
>
>
> On Thu, Oct 28, 2021 at 1:52 PM Janne Johansson  wrote:
>>
>> Den tors 28 okt. 2021 kl 10:18 skrev Lokendra Rathour 
>> :
>> >
>> > Hi Christian,
>> > Thanks for the update.
>> > I have 5 SSD on each node i.e. a total of 15 SSD using which I have 
>> > created this RAID 0 Disk, which in Ceph becomes three OSD. Each OSD with 
>> > around 4.4 TB of disk. and in total it is coming around 13.3 TB.
>> > Do you feel local RAID is an issue here? Keeping independent disks can 
>> > help recovery fast or increase the performance? please advice.
>>
>>
>> That is a very poor way to set up ceph storage.
>>
>>
>> --
>> May the most significant bit of your life be positive.
>
>
>
> --
> ~ Lokendra
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io