ceph pg dump | grep backfill Look through the output of that command and see the acting (osds the pg is on/moving off of) and current (where the pg will end up). All it takes is a single osd being listed on a pg currently backfilling and any other PGs it's listed on will be backfill+wait and have to wait until there is an available osd_max_backfill for it to start.
On Thu, Jul 6, 2017, 1:57 PM <[email protected]> wrote: > Thanks for your response David. > > What you've described has been what I've been thinking about too. We have > 1401 OSDs in the cluster currently and this output is from the tail end of > the backfill for +64 PG increase on the biggest pool. > > The problem is we see this cluster do at most 20 backfills at the same > time and as the queue of PGs to backfill gets smaller there are fewer and > fewer actively backfilling which I don't quite understand. > > Out of the PGs currently backfilling, all of them have completely changed > their sets (difference between acting and up sets is 11), which makes some > sense since what moves around are the newly spawned PGs. That's 5 PGs > currently in backfilling states which makes 110 OSDs blocked. What happened > to the other 1300? That's what's strange to me. There are another 7 waiting > to backfill. > Out of all the OSDs in the up and acting sets of all PGs currently > backfilling or waiting to backfill there are 13 OSDs in common so I guess > that kind of answers it. I haven't checked to see but I suspect each > backfilling PG has at least one OSD in one of its sets in common with > either set of one of the waiting PGs. > > So I guess we can't do much about the tail end taking so long: there's no > way for more of the PGs to actually be backfilling at the same time. > > I think we'll have to try bumping osd_max_backfills. Has anyone tried > bumping the relative priorities of recovery vs others? What about noscrub? > > Best regards, > > George > > ________________________________ > From: David Turner [[email protected]] > Sent: 06 July 2017 16:08 > To: Vasilakakos, George (STFC,RAL,SC); [email protected] > Subject: Re: [ceph-users] Speeding up backfill after increasing PGs and or > adding OSDs > > Just a quick place to start is osd_max_backfills. You have this set to > 1. Each PG is on 11 OSDs. When you have a PG moving, it is on the > original 11 OSDs and the new X number of OSDs that it is going to. For > each of your PGs that is moving, an OSD can only move 1 at a time (your > osd_max_backfills), and each PG is on 11 + X OSDs. > > So with your cluster. I don't see how many OSDs you have, but you have 25 > PGs moving around and 8 of them are actively backfilling. Assuming you > were only changing 1 OSD per backfill operation, that would mean that you > had at least 96 OSDs (11+1 * 8). That would be a perfect distribution of > OSDs for the PGs backfilling. Let's say now that you're averaging closer > to 3 OSDs changing per PG and that the remaining 17 PGs waiting to backfill > are blocked by a few OSDs each (because those OSDs are already included in > the 8 active backfilling PGs. That would indicate that you have closer to > 200+ OSDs. > > Every time I'm backfilling and want to speed things up, I watch iostat on > some of my OSDs and increase osd_max_backfills until I'm consistently using > about 70% of the disk to allow for customer overhead. You can always > figure out what's best for your use case though. Generally I've been ok > running with osd_max_backfills=5 without much problem and bringing that up > some when I know that client IO will be minimal, but again it depends on > your use case and cluster. > > On Thu, Jul 6, 2017 at 10:08 AM <[email protected]<mailto: > [email protected]>> wrote: > Hey folks, > > We have a cluster that's currently backfilling from increasing PG counts. > We have tuned recovery and backfill way down as a "precaution" and would > like to start tuning it to bring up to a good balance between that and > client I/O. > > At the moment we're in the process of bumping up PG numbers for pools > serving production workloads. Said pools are EC 8+3. > > It looks like we're having very low numbers of PGs backfilling as in: > > 2567 TB used, 5062 TB / 7630 TB avail > 145588/849529410 objects degraded (0.017%) > 5177689/849529410 objects misplaced (0.609%) > 7309 active+clean > 23 active+clean+scrubbing > 18 active+clean+scrubbing+deep > 13 active+remapped+backfill_wait > 5 active+undersized+degraded+remapped+backfilling > 4 active+undersized+degraded+remapped+backfill_wait > 3 active+remapped+backfilling > 1 active+clean+inconsistent > recovery io 1966 MB/s, 96 objects/s > client io 726 MB/s rd, 147 MB/s wr, 89 op/s rd, 71 op/s wr > > Also, the rate of recovery in terms of data and object throughput varies a > lot, even with the number of PGs backfilling remaining constant. > > Here's the config in the OSDs: > > "osd_max_backfills": "1", > "osd_min_recovery_priority": "0", > "osd_backfill_full_ratio": "0.85", > "osd_backfill_retry_interval": "10", > "osd_allow_recovery_below_min_size": "true", > "osd_recovery_threads": "1", > "osd_backfill_scan_min": "16", > "osd_backfill_scan_max": "64", > "osd_recovery_thread_timeout": "30", > "osd_recovery_thread_suicide_timeout": "300", > "osd_recovery_sleep": "0", > "osd_recovery_delay_start": "0", > "osd_recovery_max_active": "5", > "osd_recovery_max_single_start": "1", > "osd_recovery_max_chunk": "8388608", > "osd_recovery_max_omap_entries_per_chunk": "64000", > "osd_recovery_forget_lost_objects": "false", > "osd_scrub_during_recovery": "false", > "osd_kill_backfill_at": "0", > "osd_debug_skip_full_check_in_backfill_reservation": "false", > "osd_debug_reject_backfill_probability": "0", > "osd_recovery_op_priority": "5", > "osd_recovery_priority": "5", > "osd_recovery_cost": "20971520", > "osd_recovery_op_warn_multiple": "16", > > What I'm looking for, first of all, is a better understanding of the > mechanism that schedules the backfilling/recovery work; the end goal is to > understand how to tune this safely to achieve as close to an optimal > balance between rate at which recovery and client work is performed. > > I'm thinking things like osd_max_backfills, > osd_backfill_scan_min/osd_backfill_scan_max might be prime candidates for > tuning. > > Any thoughs/insights by the Ceph community will be greatly appreciated, > > George > _______________________________________________ > ceph-users mailing list > [email protected]<mailto:[email protected]> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
