On 3/7/19 8:03 AM, Sergio Lopez wrote: > While child_job_drained_begin() calls to job_pause(), the job doesn't > actually transition between states until it runs again and reaches a > pause point. This means bdrv_drained_begin() may return with some jobs > using the node still having 'busy == true'. > > As a consequence, block_job_detach_aio_context() may get into a > deadlock, waiting for the job to be actually paused, while the coroutine > servicing the job is yielding and doesn't get the opportunity to get > scheduled again. This situation can be reproduced by issuing a > 'block-commit' immediately followed by a 'device_del'. > > To ensure bdrv_drained_begin() only returns when the jobs have been > paused, we change mirror_drained_poll() to only confirm it's quiesced > when job->paused == true and there aren't any in-flight requests, except > if we reached that point by a drained section initiated by the > mirror/commit job itself. > > The other block jobs shouldn't need any changes, as the default > drained_poll() behavior is to only confirm it's quiesced if the job is > not busy or completed. > > Signed-off-by: Sergio Lopez <s...@redhat.com> > --- > block/mirror.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) >
> @@ -1119,6 +1126,16 @@ static void coroutine_fn mirror_pause(Job *job) > static bool mirror_drained_poll(BlockJob *job) > { > MirrorBlockJob *s = container_of(job, MirrorBlockJob, common); > + > + /* If the job isn't paused nor cancelled, we can't be sure that it won't > + * issue more requets. We make an exception if we've reached this point requests > + * from one of our own drain sections, to avoid a deadlock waiting for > + * ourselves. > + */ > + if (!s->common.job.paused && !s->common.job.cancelled && !s->in_drain) { > + return true; > + } > + > return !!s->in_flight; > } > > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
signature.asc
Description: OpenPGP digital signature