Hi Sage,
I uploaded a lot of debug logs from the OSDs and Mons:
ceph-post-file: 4ebc2eeb-7bb1-48c4-bbfa-ed581faca74f
At 13:24:25 I stopped OSD 122 and one Minute later I started it again.
In both cases I got slow ops.
Currently I running the upstream Version (without crude patches)
ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
(stable)
I hope you can work with it.
here the current config
# ceph config dump
WHO MASK LEVEL OPTION VALUE
RO
global advanced osd_fast_shutdown false
global advanced osd_fast_shutdown_notify_mon false
global dev osd_pool_default_read_lease_ratio
0.800000
global advanced paxos_propose_interval
1.000000
mon advanced auth_allow_insecure_global_id_reclaim true
mon advanced mon_warn_on_insecure_global_id_reclaim false
mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 1
mgr advanced mgr/progress/enabled false
*
osd dev bluestore_fsck_quick_fix_on_mount
true
# cat /etc/ceph/ceph.conf
[global]
# The following parameters are defined in the service.properties like below
# ceph.conf.globa.osd_max_backfills: 1
bluefs bufferd io = true
bluestore fsck quick fix on mount = false
cluster network = 10.88.26.0/24
fsid = 72ccd9c4-5697-478c-99f6-b5966af278c6
max open files = 131072
mon host = 10.88.7.41 10.88.7.42 10.88.7.43
mon max pg per osd = 600
mon osd down out interval = 1800
mon osd down out subtree limit = host
mon osd initial require min compat client = luminous
mon osd min down reporters = 2
mon osd reporter subtree level = host
mon pg warn max object skew = 100
osd backfill scan max = 16
osd backfill scan min = 8
osd deep scrub stride = 1048576
osd disk threads = 1
osd heartbeat min size = 0
osd max backfills = 1
osd max scrubs = 1
osd op complaint time = 5
osd pool default flag hashpspool = true
osd pool default min size = 1
osd pool default size = 3
osd recovery max active = 1
osd recovery max single start = 1
osd recovery op priority = 3
osd recovery sleep hdd = 0.0
osd scrub auto repair = true
osd scrub begin hour = 5
osd scrub chunk max = 1
osd scrub chunk min = 1
osd scrub during recovery = true
osd scrub end hour = 23
osd scrub load threshold = 1
osd scrub priority = 1
osd scrub thread suicide timeout = 0
osd snap trim priority = 1
osd snap trim sleep = 1.0
public network = 10.88.7.0/24
[mon]
mon allow pool delete = false
mon health preluminous compat warning = false
osd pool default flag hashpspool = true
On Thu, 11 Nov 2021 09:16:20 -0600
Sage Weil <[email protected]> wrote:
> Hi Manuel,
>
> Before giving up and putting in an off switch, I'd like to understand
> why it is taking as long as it is for the PGs to go active.
>
> Would you consider enabling debug_osd=10 and debug_ms=1 on your OSDs,
> and debug_mon=10 + debug_ms=1 on the mons, and reproducing this
> (without the patch applied this time of course!)? The logging will
> slow things down a bit but hopefully the behavior will be close
> enough to what you see normally that we can tell what is going on
> (and presumably picking out the pg that was most laggy will highlight
> the source(s) of the delay).
>
> sage
>
> On Wed, Nov 10, 2021 at 4:41 AM Manuel Lausch <[email protected]>
> wrote:
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]