Am 21.09.2018 um 19:01 hat Eric Blake geschrieben: > On 9/20/18 11:19 AM, Kevin Wolf wrote: > > For the block job drain test, don't only test draining the source and > > the target node, but create a backing chain for the source > > (source_backing <- source <- source_overlay) and test draining each of > > the nodes in it. > > > > When using iothreads, the source node (and therefore the job) is in a > > different AioContext than the drain, which happens from the main > > thread. This way, the main thread waits in AIO_WAIT_WHILE() for the > > iothread to make process and aio_wait_kick() is required to notify it. > > The test validates that calling bdrv_wakeup() for a child or a parent > > node will actually notify AIO_WAIT_WHILE() instead of letting it hang. > > > > Signed-off-by: Kevin Wolf <kw...@redhat.com> > > --- > > tests/test-bdrv-drain.c | 75 > > +++++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 67 insertions(+), 8 deletions(-) > > > > > @@ -818,12 +819,17 @@ static int coroutine_fn test_job_run(Job *job, Error > > **errp) > > { > > TestBlockJob *s = container_of(job, TestBlockJob, common.job); > > + /* We are running the actual job code past the pause point in > > + * job_co_entry(). */ > > + s->running = true; > > + > > job_transition_to_ready(&s->common.job); > > while (!s->should_complete) { > > /* Avoid job_sleep_ns() because it marks the job as !busy. We > > want to > > * emulate some actual activity (probably some I/O) here so that > > drain > > * has to wait for this acitivity to stop. */ > > - qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000); > > + qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000); > > The commit message didn't mention why you had to lengthen this sleep, but > I'm guessing it's because you are now doing enough additional things that > you have to scale the delay to match?
Maybe it's actually interesting enought to add a paragraph to the commit message: When running under 'rr record -n' (because I wanted to debug some behaviour), the bug suddently didn't reproduce any more. This was because bdrv_drain_invoke_entry() (in the main thread) was only called after the job had already reached the pause point, so we got a bdrv_dec_in_flight() from the main thread and the additional aio_wait_kick() when the job becomes idle wasn't necessary any more. I don't think this would happen outside a debugging environment (no idea what rr does to multithreading), but I thought increasing the delay can't hurt because it's still quite short (1 ms). Of course, if you have an idea how to check the actual condition (only allow a pause point after the wakeup from bdrv_drain_invoke_entry() has already happened), I'd consider that instead. But I don't think there's an easy way to get this information. Kevin