Hi Sam,
We discussed this briefly on IRC, I think it might be better to recap with an
email.
Currently we schedule the backfill/recovery based on how degrade the PG is,
with a factor distinguishing recovery vs. backfill (recovery always has higher
priority). The way to calculate the degradation level of a PG is:
{expected_pool_size} - {acting_set_size}. I think there are two issues with the
current approach:
1. The current {acting_size_size} might not capture the degradation level over
the past intervals. For example, we have two PGs (Erasure Coding with 8 data
and 3 parity chunks) 1.0 and 1.1:
1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is
11
1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set
is 9
1.3 At t3, we start recovering (e.g. mark out some OSDs)
With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the
concurrency is configured as 1), however, from a data durability's perspective,
the data written between t1 and t2 are more degraded and risky.
2. The algorithm does not take EC/replication into account (and EC profile),
which might be also important go consider the data durability.
Is my understanding correct here?
Thanks,
Guang --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html