On Tue, Jan 19, 2021 at 7:18 PM Bob Peterson <[email protected]> wrote: > ----- Original Message ----- > > On Tue, Jan 19, 2021 at 4:44 PM Bob Peterson <[email protected]> wrote: > > > Sure, the recovery workers' bio allocations and submitting may be > > > serialized, > > > but that's where it ends. The recovery workers don't prevent races with > > > each > > > other when using the variable common to all of them: sdp->sd_log_bio. > > > This is the case when there are, for example, 5 journals with 5 different > > > recovery workers, all trying to use the same sdp->sd_log_bio at the same > > > time. > > > > Well, sdp->sd_log_bio obviously needs to be moved to a per-journal context. > > I tried that and it didn't end well. If we keep multiple bio pointers, each > recovery worker still needs to make sure all the other bios are submitted > before allocating a new one. Sure, it could make sure _its_ previous bio was > submitted, and the others would be serialized, but there are cases in which > they can run out of bios. Yes, I saw this.
This doesn't make sense. If you've seen starvation, it must have been for another reason. > This can happen, for example, > when you have 60 gfs2 mounts times 5 nodes, with lots of workers requesting > lots of bios at the same time. Unless, of course, we allocate unique bio_sets > that get their own slabs, etc. We can introduce spinlock locking or something > to manage this, but when I tried it, I found multiple scenarios that deadlock. > It gets ugly really fast. As long as each worker makes sure it doesn't allocate another bio before submitting its previous bio, it doesn't matter how many workers there are or what state they're in. They will still make progress as long as they can allocate at least one bio overall. > In practice, when multiple nodes in a cluster go down, their journals are > recovered by several of the remaining cluster nodes, which means they happen > simultaneously anyway, and pretty quickly. In my case, I've got 5 nodes and > 2 of them get shot, so the remaining 3 nodes do the journal recovery, and > I've never seen them conflict with one another. Their glocks seem to > distribute > the work well. > > The only time you're really going to see multiple journals recovered by the > same node (for the same file systems anyway) is when the cluster loses quorum. > Then when quorum is regained, there is often a burst of requests to recover > multiple journals on the same few nodes. Then the same node often tries to > recover several journals for several file systems. > > So the circumstances are unusual to begin with. But also very recreatable. > > What's wrong with a single worker that handles them all? What's your actual > concern with doing it this way? Is it performance? Who cares if journal > recovery takes 1.4 seconds rather than 1.2 seconds? It was Steve who questioned if serializing recovery in that way was a reasonable change. I don't know if recovering multiple journals on the same node in parallel is very useful. But I also don't buy your bio starvation argument. Thanks, Andreas
