[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
Now that you're on wpq, you can try tweaking osd_max_backfills (up)
and osd_recovery_sleep (down).

Josh

On Fri, May 24, 2024 at 1:07 PM Mazzystr  wrote:
>
> I did the obnoxious task of updating ceph.conf and restarting all my osds.
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get osd_op_queue
> {
> "osd_op_queue": "wpq"
> }
>
> I have some spare memory on my target host/osd and increased the target 
> memory of that OSD to 10 Gb and restarted.  No effect observed.  In fact mem 
> usage on the host is stable so I don't think the change took effect even with 
> updating ceph.conf, restart and a direct asok config set.  target memory 
> value is confirmed to be set via asok config get
>
> Nothing has helped.  I still cannot break the 21 MiB/s barrier.
>
> Does anyone have any more ideas?
>
> /C
>
> On Fri, May 24, 2024 at 10:20 AM Joshua Baergen  
> wrote:
>>
>> It requires an OSD restart, unfortunately.
>>
>> Josh
>>
>> On Fri, May 24, 2024 at 11:03 AM Mazzystr  wrote:
>> >
>> > Is that a setting that can be applied runtime or does it req osd restart?
>> >
>> > On Fri, May 24, 2024 at 9:59 AM Joshua Baergen 
>> > wrote:
>> >
>> > > Hey Chris,
>> > >
>> > > A number of users have been reporting issues with recovery on Reef
>> > > with mClock. Most folks have had success reverting to
>> > > osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
>> > > I haven't looked at the list myself yet.
>> > >
>> > > Josh
>> > >
>> > > On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
>> > > >
>> > > > Hi all,
>> > > > Goodness I'd say it's been at least 3 major releases since I had to do 
>> > > > a
>> > > > recovery.  I have disks with 60-75,000 power_on_hours.  I just updated
>> > > from
>> > > > Octopus to Reef last month and I'm hit with 3 disk failures and the
>> > > mclock
>> > > > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
>> > > serious
>> > > > hacking.  It started out at 9 mb/sec.
>> > > >
>> > > > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
>> > > > business.  Load is minimal so processes aren't blocked by disk io.
>> > > >
>> > > > I tried the changing all the sleeps and recovery_max and
>> > > > setting osd_mclock_profile high_recovery_ops to no change in 
>> > > > performance.
>> > > >
>> > > > Does anyone have any suggestions to improve performance?
>> > > >
>> > > > Thanks,
>> > > > /Chris C
>> > > > ___
>> > > > ceph-users mailing list -- ceph-users@ceph.io
>> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
It requires an OSD restart, unfortunately.

Josh

On Fri, May 24, 2024 at 11:03 AM Mazzystr  wrote:
>
> Is that a setting that can be applied runtime or does it req osd restart?
>
> On Fri, May 24, 2024 at 9:59 AM Joshua Baergen 
> wrote:
>
> > Hey Chris,
> >
> > A number of users have been reporting issues with recovery on Reef
> > with mClock. Most folks have had success reverting to
> > osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
> > I haven't looked at the list myself yet.
> >
> > Josh
> >
> > On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
> > >
> > > Hi all,
> > > Goodness I'd say it's been at least 3 major releases since I had to do a
> > > recovery.  I have disks with 60-75,000 power_on_hours.  I just updated
> > from
> > > Octopus to Reef last month and I'm hit with 3 disk failures and the
> > mclock
> > > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
> > serious
> > > hacking.  It started out at 9 mb/sec.
> > >
> > > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> > > business.  Load is minimal so processes aren't blocked by disk io.
> > >
> > > I tried the changing all the sleeps and recovery_max and
> > > setting osd_mclock_profile high_recovery_ops to no change in performance.
> > >
> > > Does anyone have any suggestions to improve performance?
> > >
> > > Thanks,
> > > /Chris C
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
Hey Chris,

A number of users have been reporting issues with recovery on Reef
with mClock. Most folks have had success reverting to
osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
I haven't looked at the list myself yet.

Josh

On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
>
> Hi all,
> Goodness I'd say it's been at least 3 major releases since I had to do a
> recovery.  I have disks with 60-75,000 power_on_hours.  I just updated from
> Octopus to Reef last month and I'm hit with 3 disk failures and the mclock
> ugliness.  My recovery is moving at a wondrous 21 mb/sec after some serious
> hacking.  It started out at 9 mb/sec.
>
> My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> business.  Load is minimal so processes aren't blocked by disk io.
>
> I tried the changing all the sleeps and recovery_max and
> setting osd_mclock_profile high_recovery_ops to no change in performance.
>
> Does anyone have any suggestions to improve performance?
>
> Thanks,
> /Chris C
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Joshua Baergen
Hey Anthony,

Like with many other options in Ceph, I think what's missing is the
user-visible effect of what's being altered. I believe the reason why
synchronous recovery is still used is that, assuming that per-object
recovery is quick, it's faster to complete than asynchronous recovery,
which has extra steps on either end of the recovery process. Of
course, as you know, synchronous recovery blocks I/O, so when
per-object recovery isn't quick, as in RGW index omap shards,
particularly large shards, IMO we're better off always doing async
recovery.

I don't know enough about the overheads involved here to evaluate
whether it's worth keeping synchronous recovery at all, but IMO RGW
index/usage(/log/gc?) pools are always better off using asynchronous
recovery.

Josh

On Wed, Apr 3, 2024 at 1:48 PM Anthony D'Atri  wrote:
>
> We currently have in  src/common/options/global.yaml.in
>
> - name: osd_async_recovery_min_cost
>   type: uint
>   level: advanced
>   desc: A mixture measure of number of current log entries difference and 
> historical
> missing objects,  above which we switch to use asynchronous recovery when 
> appropriate
>   default: 100
>   flags:
>   - runtime
>
> I'd like to rephrase the description there in a PR, might you be able to 
> share your insight into the dynamics so I can craft a better description?  
> And do you have any thoughts on the default value?  Might appropriate values 
> vary by pool type and/or media?
>
>
>
> > On Apr 3, 2024, at 13:38, Joshua Baergen  wrote:
> >
> > We've had success using osd_async_recovery_min_cost=0 to drastically
> > reduce slow ops during index recovery.
> >
> > Josh
> >
> > On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham  
> > wrote:
> >>
> >> I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which
> >> supports the RGW index pool causes crippling slow ops. If the OSD is marked
> >> with primary-affinity of 0 prior to the OSD restart no slow ops are
> >> observed. If the OSD has a primary affinity of 1 slow ops occur. The slow
> >> ops only occur during the recovery period of the OMAP data and further only
> >> occur when client activity is allowed to pass to the cluster. Luckily I am
> >> able to test this during periods when I can disable all client activity at
> >> the upstream proxy.
> >>
> >> Given the behavior of the primary affinity changes preventing the slow ops
> >> I think this may be a case of recovery being more detrimental than
> >> backfill. I am thinking that causing an pg_temp acting set by forcing
> >> backfill may be the right method to mitigate the issue. [1]
> >>
> >> I believe that reducing the PG log entries for these OSDs would accomplish
> >> that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may
> >> also accomplish something similar. Not sure the appropriate tuning for that
> >> config at this point or if there may be a better approach. Seeking any
> >> input here.
> >>
> >> Further if this issue sounds familiar or sounds like another condition
> >> within the OSD may be at hand I would be interested in hearing your input
> >> or thoughts. Thanks!
> >>
> >> [1] https://docs.ceph.com/en/latest/dev/peering/#concepts
> >> [2]
> >> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost
> >>
> >> Respectfully,
> >>
> >> *Wes Dillingham*
> >> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
> >> w...@wesdillingham.com
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Joshua Baergen
We've had success using osd_async_recovery_min_cost=0 to drastically
reduce slow ops during index recovery.

Josh

On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham  
wrote:
>
> I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which
> supports the RGW index pool causes crippling slow ops. If the OSD is marked
> with primary-affinity of 0 prior to the OSD restart no slow ops are
> observed. If the OSD has a primary affinity of 1 slow ops occur. The slow
> ops only occur during the recovery period of the OMAP data and further only
> occur when client activity is allowed to pass to the cluster. Luckily I am
> able to test this during periods when I can disable all client activity at
> the upstream proxy.
>
> Given the behavior of the primary affinity changes preventing the slow ops
> I think this may be a case of recovery being more detrimental than
> backfill. I am thinking that causing an pg_temp acting set by forcing
> backfill may be the right method to mitigate the issue. [1]
>
> I believe that reducing the PG log entries for these OSDs would accomplish
> that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may
> also accomplish something similar. Not sure the appropriate tuning for that
> config at this point or if there may be a better approach. Seeking any
> input here.
>
> Further if this issue sounds familiar or sounds like another condition
> within the OSD may be at hand I would be interested in hearing your input
> or thoughts. Thanks!
>
> [1] https://docs.ceph.com/en/latest/dev/peering/#concepts
> [2]
> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost
>
> Respectfully,
>
> *Wes Dillingham*
> LinkedIn 
> w...@wesdillingham.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: S3 Partial Reads from Erasure Pool

2024-04-01 Thread Joshua Baergen
I think it depends what you mean by rados objects and s3 objects here. If
you're talking about an object that was uploaded via MPU, and thus may
comprise many rados objects, I don't think there's a difference in read
behaviors based on pool type. If you're talking about reading a subset byte
range from a single rados object stored on an EC pool, yes, the whole
object is read from the pool in order to serve that subset read, something
that https://github.com/ceph/ceph/pull/55196 endeavours to address.

Josh

On Mon, Mar 25, 2024, 4:27 p.m.  wrote:

> I am dealing with a cluster that is having terrible performance with
> partial reads from an erasure coded pool. Warp tests and s3bench tests
> result in acceptable performance but when the application hits the data,
> performance plummets. Can anyone clear this up for me, When radosgw gets a
> partial read does it have to assemble all the rados objects that make up
> the s3 object before returning the range? With a replicated poll i am
> seeing 6 to 7 GiB/s of read performance and only 1GiB/s of read from the
> erasure coded pool which leads me to believe that the replicated pool is
> returning just the rados objects for the partial s3 object and the erasure
> coded pool is not.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: log_latency slow operation observed for submit_transact, latency = 22.644258499s

2024-03-22 Thread Joshua Baergen
Personally, I don't think the compaction is actually required. Reef
has compact-on-iteration enabled, which should take care of this
automatically. We see this sort of delay pretty often during PG
cleaning, at the end of a PG being cleaned, when the PG has a high
count of objects, whether or not OSD compaction has been keeping up
with tombstones. It's unfortunately just something to ride through
these days until backfill completes.

https://github.com/ceph/ceph/pull/49438 is a recent attempt to improve
things in this area, but I'm not sure whether it would eliminate this
issue. We've considered going to higher PG counts (and thus fewer
objects per PG) as a possible mitigation as well.

Josh

On Fri, Mar 22, 2024 at 2:59 AM Alexander E. Patrakov
 wrote:
>
> Hello Torkil,
>
> The easiest way (in my opinion) to perform offline compaction is a bit
> different than what Igor suggested. We had a prior off-list
> conversation indicating that the results would be equivalent.
>
> 1. ceph config set osd osd_compact_on_start true
> 2. Restart the OSD that you want to compact (or the whole host at
> once, if you want to compact the whole host and your failure domain
> allows for that)
> 3. ceph config set osd osd_compact_on_start false
>
> The OSD will restart, but will not show as "up" until the compaction
> process completes. In your case, I would expect it to take up to 40
> minutes.
>
> On Fri, Mar 22, 2024 at 3:46 PM Torkil Svensgaard  wrote:
> >
> >
> > On 22-03-2024 08:38, Igor Fedotov wrote:
> > > Hi Torkil,
> >
> > Hi Igor
> >
> > > highly likely you're facing a well known issue with RocksDB performance
> > > drop after bulk data removal. The latter might occur at source OSDs
> > > after PG migration completion.
> >
> > Aha, thanks.
> >
> > > You might want to use DB compaction (preferably offline one using ceph-
> > > kvstore-tool) to get OSD out of this "degraded" state or as a preventive
> > > measure. I'd recommend to do that for all the OSDs right now. And once
> > > again after rebalancing is completed.  This should improve things but
> > > unfortunately no 100% guarantee.
> >
> > Why is offline preferred? With offline the easiest way would be
> > something like stop all OSDs one host at a time and run a loop over
> > /var/lib/ceph/$id/osd.*?
> >
> > > Also curious if you have DB/WAL on fast (SSD or NVMe) drives? This might
> > > be crucial..
> >
> > We do, 22 HDDs and 2 DB/WAL NVMes pr host.
> >
> > Thanks.
> >
> > Mvh.
> >
> > Torkil
> >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > > On 3/22/2024 9:59 AM, Torkil Svensgaard wrote:
> > >> Good morning,
> > >>
> > >> Cephadm Reef 18.2.1. We recently added 4 hosts and changed a failure
> > >> domain from host to datacenter which is the reason for the large
> > >> misplaced percentage.
> > >>
> > >> We were seeing some pretty crazy spikes in "OSD Read Latencies" and
> > >> "OSD Write Latencies" on the dashboard. Most of the time everything is
> > >> well but then for periods of time, 1-4 hours, latencies will go to 10+
> > >> seconds for one or more OSDs. This also happens outside scrub hours
> > >> and it is not the same OSDs every time. The OSDs affected are HDD with
> > >> DB/WAL on NVMe.
> > >>
> > >> Log snippet:
> > >>
> > >> "
> > >> ...
> > >> 2024-03-22T06:48:22.859+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:22.859+ 7fb185b54700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:22.864+ 7fb169898700  1 heartbeat_map
> > >> clear_timeout 'OSD::osd_op_tp thread 0x7fb169898700' had timed out
> > >> after 15.00954s
> > >> 2024-03-22T06:48:22.864+ 7fb169898700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency slow operation observed for submit_transact,
> > >> latency = 17.716707230s
> > >> 2024-03-22T06:48:22.880+ 7fb1748ae700  0 bluestore(/var/lib/ceph/
> > >> osd/ceph-112) log_latency_fn slow operation observed for
> > >> _txc_committed_kv, latency = 17.732601166s, txc = 0x55a5bcda0f00
> > >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> 2024-03-22T06:48:38.077+ 7fb184b52700  1 heartbeat_map is_healthy
> > >> 'OSD::osd_op_tp thread 0x7fb169898700' had timed out after 15.00954s
> > >> ...
> > >> "
> > >>
> > >> "
> > >> [root@dopey ~]# ceph -s
> > >>   cluster:
> > >> id: 8ee2d228-ed21-4580-8bbf-0649f229e21d
> > >> health: HEALTH_WARN
> > >> 1 failed cephadm daemon(s)
> > >> Low space hindering backfill (add storage if this doesn't
> > >> resolve itself): 1 pg backfill_toofull
> > >>
> > >>   services:
> > >> mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 3d)
> > >> mgr: jolly.tpgixt(active, since 10d), standbys: dopey.lxajvk,
> > >> lazy.xuhetq
> > >> mds: 

[ceph-users] Re: Why a lot of pgs are degraded after host(+osd) restarted?

2024-03-20 Thread Joshua Baergen
Hi Jaemin,

It is normal for PGs to become degraded during a host reboot, since a
copy of the data was taken offline and needs to be resynchronized
after the host comes back. Normally this is quick, as the recovery
mechanism only needs to modify those objects that have changed while
the host is down.

However, if you have backfills ongoing and reboot a host that contains
OSDs involved in those backfills, then those backfills become
degraded, and you will need to wait for them to complete for
degradation to clear. Do you know if you had backfills at the time the
host was rebooted? If so, the way to avoid this is to wait for
backfill to complete before taking any OSDs/hosts down for
maintenance.

Josh

On Wed, Mar 20, 2024 at 1:50 AM Jaemin Joo  wrote:
>
> Hi all,
>
> While I am testing host failover, there are a lot of degraded pg after
> host(+osd) is up. In spite that it takes a short time to restart, I don't
> understand why pg should check all objects related to the failed host(+osd).
> I'd like to know how to prevent to become degraded pg when osd restart.
>
> FYI. degraded pg means "active+undersized+degraded+remapped+backfilling" pg
> state
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs not balanced

2024-03-04 Thread Joshua Baergen
The balancer will operate on all pools unless otherwise specified.

Josh

On Mon, Mar 4, 2024 at 1:12 PM Cedric  wrote:
>
> Did the balancer has enabled pools ? "ceph balancer pool ls"
>
> Actually I am wondering if the balancer do something when no pools are
> added.
>
>
>
> On Mon, Mar 4, 2024, 11:30 Ml Ml  wrote:
>
> > Hello,
> >
> > i wonder why my autobalancer is not working here:
> >
> > root@ceph01:~# ceph -s
> >   cluster:
> > id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
> > health: HEALTH_ERR
> > 1 backfillfull osd(s)
> > 1 full osd(s)
> > 1 nearfull osd(s)
> > 4 pool(s) full
> >
> > => osd.17 was too full (92% or something like that)
> >
> > root@ceph01:~# ceph osd df tree
> > ID   CLASS  WEIGHT REWEIGHT  SIZE ... %USE  ... PGS TYPE NAME
> > -25 209.50084 -  213 TiB  ... 69.56 ...   - datacenter
> > xxx-dc-root
> > -19  84.59369 -   86 TiB  ... 56.97 ...   - rack
> > RZ1.Reihe4.R10
> >  -3  35.49313 -   37 TiB  ... 57.88 ...   - host
> > ceph02
> >   2hdd1.7   1.0  1.7 TiB  ... 58.77 ...  44
> >  osd.2
> >   3hdd1.0   1.0  2.7 TiB  ... 22.14 ...  25
> >  osd.3
> >   7hdd2.5   1.0  2.7 TiB  ... 58.84 ...  70
> >  osd.7
> >   9hdd9.5   1.0  9.5 TiB  ... 63.07 ... 268
> >  osd.9
> >  13hdd2.67029   1.0  2.7 TiB  ... 53.59 ...  65
> >  osd.13
> >  16hdd2.8   1.0  2.7 TiB  ... 59.35 ...  71
> >  osd.16
> >  19hdd1.7   1.0  1.7 TiB  ... 48.98 ...  37
> >  osd.19
> >  23hdd2.38419   1.0  2.4 TiB  ... 59.33 ...  64
> >  osd.23
> >  24hdd1.3   1.0  1.7 TiB  ... 51.23 ...  39
> >  osd.24
> >  28hdd3.63869   1.0  3.6 TiB  ... 64.17 ... 104
> >  osd.28
> >  31hdd2.7   1.0  2.7 TiB  ... 64.73 ...  76
> >  osd.31
> >  32hdd3.3   1.0  3.3 TiB  ... 67.28 ... 101
> >  osd.32
> >  -9  22.88817 -   23 TiB  ... 56.96 ...   - host
> > ceph06
> >  35hdd7.15259   1.0  7.2 TiB  ... 55.71 ... 182
> >  osd.35
> >  36hdd5.24519   1.0  5.2 TiB  ... 53.75 ... 128
> >  osd.36
> >  45hdd5.24519   1.0  5.2 TiB  ... 60.91 ... 144
> >  osd.45
> >  48hdd5.24519   1.0  5.2 TiB  ... 57.94 ... 139
> >  osd.48
> > -17  26.21239 -   26 TiB  ... 55.67 ...   - host
> > ceph08
> >  37hdd6.67569   1.0  6.7 TiB  ... 58.17 ... 174
> >  osd.37
> >  40hdd9.53670   1.0  9.5 TiB  ... 58.54 ... 250
> >  osd.40
> >  46hdd5.0   1.0  5.0 TiB  ... 52.39 ... 116
> >  osd.46
> >  47hdd5.0   1.0  5.0 TiB  ... 50.05 ... 112
> >  osd.47
> > -20  59.11053 -   60 TiB  ... 82.47 ...   - rack
> > RZ1.Reihe4.R9
> >  -4  23.09996 -   24 TiB  ... 79.92 ...   - host
> > ceph03
> >   5hdd1.7   0.75006  1.7 TiB  ... 87.24 ...  66
> >  osd.5
> >   6hdd1.7   0.44998  1.7 TiB  ... 47.30 ...  36
> >  osd.6
> >  10hdd2.7   0.85004  2.7 TiB  ... 83.23 ... 100
> >  osd.10
> >  15hdd2.7   0.75006  2.7 TiB  ... 74.26 ...  88
> >  osd.15
> >  17hdd0.5   0.85004  1.6 TiB  ... 91.44 ...  67
> >  osd.17
> >  20hdd2.0   0.85004  1.7 TiB  ... 88.41 ...  68
> >  osd.20
> >  21hdd2.7   0.75006  2.7 TiB  ... 77.25 ...  91
> >  osd.21
> >  25hdd1.7   0.90002  1.7 TiB  ... 78.31 ...  60
> >  osd.25
> >  26hdd2.7   1.0  2.7 TiB  ... 82.75 ...  99
> >  osd.26
> >  27hdd2.7   0.90002  2.7 TiB  ... 84.26 ... 101
> >  osd.27
> >  63hdd1.8   0.90002  1.7 TiB  ... 84.15 ...  65
> >  osd.63
> > -13  36.01057 -   36 TiB  ... 84.12 ...   - host
> > ceph05
> >  11hdd7.15259   0.90002  7.2 TiB  ... 85.45 ... 273
> >  osd.11
> >  39hdd7.2   0.85004  7.2 TiB  ... 80.90 ... 257
> >  osd.39
> >  41hdd7.2   0.75006  7.2 TiB  ... 74.95 ... 239
> >  osd.41
> >  42hdd9.0   1.0  9.5 TiB  ... 92.00 ... 392
> >  osd.42
> >  43hdd5.45799   1.0  5.5 TiB  ... 84.84 ... 207
> >  osd.43
> > -21  65.79662 -   66 TiB  ... 74.29 ...   - rack
> > RZ3.Reihe3.R10
> >  -2  28.49664 -   29 TiB  ... 74.79 ...   - host
> > ceph01
> >   0hdd2.7   1.0  2.7 TiB  ... 73.82 ...  88
> >  osd.0
> >   1hdd3.63869   1.0  3.6 TiB  ... 73.47 ... 121
> >  osd.1
> >   4hdd2.7   1.0  2.7 TiB  ... 74.63 ...  89
> >  osd.4
> >   8hdd2.7   1.0  2.7 TiB  ... 77.10 ...  92
> >  osd.8
> >  12hdd2.7   1.0  2.7 TiB  ... 78.76 ...  94
> >  osd.12
> >  14hdd5.45799   1.0  5.5 TiB  ... 78.86 ... 193
> >  osd.14
> >  18hdd1.8   1.0  2.7 TiB  ... 63.79 ...  76
> >  osd.18
> >  22hdd

[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread Joshua Baergen
Periodic discard was actually attempted in the past:
https://github.com/ceph/ceph/pull/20723

A proper implementation would probably need appropriate
scheduling/throttling that can be tuned so as to balance against
client I/O impact.

Josh

On Sat, Mar 2, 2024 at 6:20 AM David C.  wrote:
>
> Could we not consider setting up a “bluefstrim” which could be orchestrated
> ?
>
> This would avoid having a continuous stream of (D)iscard instructions on
> the disks during activity.
>
> A weekly (probably monthly) bluefstrim could probably be enough for
> platforms that really need it.
>
>
> Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen  a
> écrit :
>
> > We've had a specific set of drives that we've had to enable
> > bdev_enable_discard and bdev_async_discard for in order to maintain
> > acceptable performance on block clusters. I wrote the patch that Igor
> > mentioned in order to try and send more parallel discards to the
> > devices, but these ones in particular seem to process them in serial
> > (based on observed discard counts and latency going to the device),
> > which is unfortunate. We're also testing new firmware that suggests it
> > should help alleviate some of the initial concerns we had about discards
> > not keeping up which prompted the patch in the first place.
> >
> > Most of our drives do not need discards enabled (and definitely not
> > without async) in order to maintain performance unless we're doing a
> > full disk fio test or something like that where we're trying to find its
> > cliff profile. We've used OSD classes to help target the options being
> > applied to specific OSDs via centralized conf which helps when we would
> > add new hosts that may have different drives so that the options weren't
> > applied globally.
> >
> > Based on our experience, I wouldn't enable it unless you're seeing some
> > sort of cliff-like behaviour as your OSDs run low on free space, or are
> > heavily fragmented. I would also deem bdev_async_enabled = 1 to be a
> > requirement so that it doesn't block user IO. Keep an eye on your
> > discards being sent to devices and the discard latency, as well (via
> > node_exporter, for example).
> >
> > Matt
> >
> >
> > On 2024-03-02 06:18, David C. wrote:
> > > I came across an enterprise NVMe used for BlueFS DB whose performance
> > > dropped sharply after a few months of delivery (I won't mention the
> > > brand
> > > here but it was not among these 3: Intel, Samsung, Micron).
> > > It is clear that enabling bdev_enable_discard impacted performance, but
> > > this option also saved the platform after a few days of discard.
> > >
> > > IMHO the most important thing is to validate the behavior when there
> > > has
> > > been a write to the entire flash media.
> > > But this option has the merit of existing.
> > >
> > > it seems to me that the ideal would be not to have several options on
> > > bdev_*discard, and that this task should be asynchronous and with the
> > > (D)iscard instructions during a calmer period of activity (I do not see
> > > any
> > > impact if the instructions are lost during an OSD reboot)
> > >
> > >
> > > Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a
> > > écrit :
> > >
> > >> I played with this feature a while ago and recall it had visible
> > >> negative impact on user operations due to the need to submit tons of
> > >> discard operations - effectively each data overwrite operation
> > >> triggers
> > >> one or more discard operation submission to disk.
> > >>
> > >> And I doubt this has been widely used if any.
> > >>
> > >> Nevertheless recently we've got a PR to rework some aspects of thread
> > >> management for this stuff, see https://github.com/ceph/ceph/pull/55469
> > >>
> > >> The author claimed they needed this feature for their cluster so you
> > >> might want to ask him about their user experience.
> > >>
> > >>
> > >> W.r.t documentation - actually there are just two options
> > >>
> > >> - bdev_enable_discard - enables issuing discard to disk
> > >>
> > >> - bdev_async_discard - instructs whether discard requests are issued
> > >> synchronously (along with disk extents release) or asynchronously
> > >> (using
> > >> a background thread).
> > >>
> > >> Thanks,
> > >>
> > >> Igor
> > >>
> > >> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> > >> > Is there any update on this? Did someone test the option and has
> > >> > performance values before and after?
> > >> > Is there any good documentation regarding this option?
> > >> > ___
> > >> > ceph-users mailing list -- ceph-users@ceph.io
> > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >> ___
> > >> ceph-users mailing list -- ceph-users@ceph.io
> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > >>
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to