Re: [Qemu-devel] [PATCH v2 04/11] blockjobs: Always use block_job_get_aio_context

John Snow Wed, 12 Oct 2016 17:52:23 -0700

As context to everyone else as to why I'm going down the rabbit hole oftrying to remove external references to AioContext at all, seehttps://lists.gnu.org/archive/html/qemu-devel/2016-10/msg00795.html


On 10/07/2016 03:49 AM, Paolo Bonzini wrote:



On 06/10/2016 22:22, John Snow wrote:

Calls .complete(), for which the only implementation is
mirror_complete. Uh, this actually seems messy. Looks like there's
nothing to prevent us from calling this after we've already told it to
complete once.


Yeah, it should have an

    if (s->should_complete) {
        return;
    }

at the beginning.  I have other mirror.c patches in my queue so I can
take care of that.

Or something up the stack at block_job_complete so it's not up to jobimplementations? What if the next implementer "forgets."

block_job_cancel and block_job_complete are different.

block_job_cancel is called in many places, but we can just add a similar
block_job_user_cancel if we wanted a version which takes care to acquire
context and one that does not. (Or we could just acquire the context
regardless, but Paolo warned me ominously that recursive locks are EVIL.
He sounded serious.)


Not that many places:

- block_job_finish_sync calls it, and you can just release/reacquire
around the call to "finish(job, &local_err)".

This makes me a little nervous because we went through the trouble ofcreating this callback, but we're going to assume we know that it's apublic interface that will take the lock for itself (or otherwise doesnot require a lock.)

In practice it works, but it seems needlessly mystifying in terms ofproving correctness.

- there are two callers in blockdev.c, and you can just remove the
acquire/release from blockdev.c if you push it in block_job_cancel.

Makes sense; I don't like the association of (bs :: job) here anyway.Again we're grabbing context for a job where that job may not even berunning.

As to block_job_cancel_sync:

Which I didn't audit, because no callers use job->blk to get theAioContext before calling this; they use bs if bs->job is present.

- replication_stop is not acquiring s->secondary_disk->bs's AioContext.

Seems like a bug on their part. Would be fixed by having cancel acquirecontext for itself.

- there is no need to hold the AioContext between ->prepare and ->clean.
 My suggestion is to ref the blockjob after storing it in state->job
(you probably should do that anyway) and unref'ing it in ->clean.  Then
you can call again let block_job_cancel_sync(bs->job) take the
AioContext, which it will do in block_job_finish_sync.


Yeah, I should be reffing it anyway.

The rest of this... What I think you mean is acquiring and releasing thecontext as needed for EACH of prepare, commit, abort, and clean asnecessary, right?

And then in this case, it simply wouldn't be necessary for abort, as thesync cancel would do it for us.

block_job_complete has no direct callers outside of QMP, but it is also
used as a callback by block_job_complete_sync, used in qemu-img for
run_block_job. I can probably rewrite qemu_img to avoid this usage.


No need to: qemu-img is not acquiring the AioContext, so it's okay to
let block_job_complete do that (like block_job_cancel,
block_job_complete will be called by block_job_finish_sync without the
AioContext acquired).


Eh? Oh, you're right, it just gets it for the sake of aio_poll.

Paolo



Alright.

Say I *do* push the acquisitions down into blockjob.c. What benefit doesthat provide? Won't I still need the block_job_get_aio_context()function (At least internally) to know which context to acquire? Thiswould preclude you from deleting it.

Plus... we remove some fairly simple locking mechanisms and then inflateit tenfold. I'm not convinced this is an improvement.

As context and a refresher (for me when I re-read this email in 12hours,) there are three places externally that are using an AioContextlock as acquired from *within* a BlockJob, excluding those that acquirea context separately from a Job and use that to reason that accesses tothe job are safe (For example, blockdev_mark_auto_del.)


(1) QMP interface for job management
(2) bdrv_drain_all, in block/io.c

(1) AFAICT, the QMP interface is concerned with assuring itself it hasunique access to the BlockJob structure itself, and it doesn't reallyauthentically care about the AIOContext itself -- just race-free accessto the Job.

This is not necessarily buggy today because, even though we grab theBlockBackend's context unconditionally, we already know the main/monitorthread is not accessing the blockjob. It's still silly, though.

(2) bdrv_drain_all appears to be worried about the same thing; we justneed to safely deliver pause/resume messages.

I'm less sure about where this can run from, and suspect that if the jobhas deferred to main that this could be buggy. If bdrv_drain_all iscalled from context A and the job is running on context M havingdeferred from B, we may lock against context B (from A) and have raceyaccess from between A/M. (Maybe?)

It looks like all of these usages don't actually really care whatcontext they're getting, just that they're getting safe access to thejob itself. If you want to decouple jobs from their BlockBackend's AIOContext, perhaps we just need to replace these by a simple mutex...?

As it stands, though, pushing AIOContext acquisitions down intoblockjob.c isn't actually going to fix anything, is it? Why not justleave it be for right now?

Would you be terribly offended if I left this patch as-is for now and wecan work on removing the AioContext locks afterwards, or are you adamantthat I cooperate in getting block_job_get_aio_context removed before Iadd more usages?



Sorry I'm being so obtuse about this all.

--js

Re: [Qemu-devel] [PATCH v2 04/11] blockjobs: Always use block_job_get_aio_context

Reply via email to