What you describe sounds like expected behavior.  It’s a feature!

Since … Nautilus I think, you or the autoscaler sets pg_num and the cluster 
gradually steps up pgp_num until it matches.

Increasing pg_num means splitting PGs, which in turn perturbs the inputs to the 
CRUSH hash function, so data moves: backfill.

Moving data on HDDs isn’t fast, especially with EC.  These are all random, 
fragmented writes, so model 70 MB/s to a given drive.

> As expected, the backfilling started ...and it never ended ...even now
> after more than 1 week I still have about 29 pgs backfilling and 13
> backfilling_wait

Back pre-Nautilus this would have been a thundering herd of backfill.  You 
don’t know how good we have it now ;)

> What worries me is that the number of backfilling PGs varies very little
> over time  e.g 28 and 12  ALTHOUGH there is constant "recovery" traffic
> between 250 and 350MiB

The number of PGs backfillING at any given time is a function of multiple 
things, including the value of osd_max_backfills.
EC means each write ties up 6 drives, so there’s a bit more gridlock compared 
to replicated pools.

> 
> The "recovery" seems to be doing something ( but number of objects remain
> the same )

The number of objects, or the number of *misplaced/remapped* objects?

Is it showing *keys* per second?  RGW stores a lot of omap data in RocksDB.

> Since the recovery should run over the cluster network and the amount of
> data in the pool is not huge, I am not sure why it takes so many days - it
> seems stuck actually

Have you reverted to the wpq scheduler?

osd_op_queue = wpq
osd_mclock_override_recovery_settings

You can also increase the value of osd_max_backfills


> The only strange thing I noticed is a discrepancy between the number of PG
> and PGP
> that the pool currently has ...and what autoscale-status says

It’s in the process of doing what you asked. 

> 
> Any help / suggestions would be very appreciated
> 
> What I have tried so for :
>     increase recovery speed ( by changing mclock profile to
> "high_recovery_ops"  and overriding various parameters)
>     (recovery_max_active, recovery_max_active_hdd ... etc)

If the default mclock scheduler is enabled, that has issues for some 
deployments. There are code improvements in the works, but for now I suggest 
reverting to wpq.

> 
>     redeploying some of the OSDs that were "UP_PRIMARY but part of the
> backfill_wait PGs

Redeploying OSDs isn’t often called for, and can chum the waters.  It also adds 
a lot of backfill/recovery to what you already have going on.

If you want a gentle goose when things seem stuck, you can try

        ceph osd down XXX

for the lead OSD of a given PG, one at a time

or 
        ceph pg repeer xx.yyyy

> 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to