I realized blockjob is freed after completed unless we call block_job_ref()
before run_block_job is called. On 2017/6/15 10:38, sochin.jiang wrote: > Thanks for your kindly reply. > > I do have made a mistake that ignoring the AIOContext lock. > > About the patch, firstly, if job->ret comes to be non-zero(also means > job->completed to be true) , blockjob 'callback'(common_block_job_cb) will be > called, blockjob error will be put into errp. It won't report success. > > Secondly, blockjob fails with 'ret < 0' and without calling > block_job_complete_sync(), we won't have segfault because bdrv_reopen won't > be called. Also, with the use-after-free problems. > > So, skip the block_job_complete_sync() call if job->completed(job->ret to be > non-zero) is true can avoid all the problems, am I right ? > > Thank you again. > > > Best Regard. > > Sochin > > > > > > > > On 2017/6/14 21:12, Max Reitz wrote: >> Thanks for your patch! The issue can be reproduced as follows: >> >> $ qemu-img create -f qcow2 -b \ >> "json:{'driver':'raw','file':{ >> 'driver':'blkdebug','inject-error':[{'event':'write_aio'}], >> 'image':{'driver':'null-co'}}}" \ >> overlay.qcow2 >> $ qemu-io -c 'write 0 64k' overlay.qcow2 >> $ qemu-img commit overlay.qcow2 >> >> While your patch fixes that issue, I still have some comments: >> >> On 2017-06-14 08:22, sochin.jiang wrote: >>> From: "sochin.jiang" <sochin.ji...@huawei.com> >>> >>> img_commit could fall into infinite loop if it's blockjob >> This should be "into an infinite loop" and "its" instead if "it's". >> >> This empty line should be omitted. >> >>> fail encountering any I/O error. Try to fix it. >> Should be "fails on any I/O error" or "fails on encountering any I/O >> error". Also, you're not trying to fix it but let's all hope you really >> are fixing it. :-) >> >> (So "Fix it." instead of "Try to fix it.") >> >>> Signed-off-by: sochin.jiang <sochin.ji...@huawei.com> >>> --- >>> qemu-img.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/qemu-img.c b/qemu-img.c >>> index 0ad698d..6ba565d 100644 >>> --- a/qemu-img.c >>> +++ b/qemu-img.c >>> @@ -895,8 +895,11 @@ static void run_block_job(BlockJob *job, Error **errp) >>> aio_poll(aio_context, true); >>> qemu_progress_print(job->len ? >>> ((float)job->offset / job->len * 100.f) : >>> 0.0f, 0); >>> - } while (!job->ready); >>> + } while (!job->ready && !job->ret); >> I think it would be better to test job->completed instead of job->ret. >> >>> >>> + if (job->ret) { >>> + return; >>> + } >> We shouldn't just return here but still do all the deinitialization like >> call aio_context_release(). I guess the best would be to just skip the >> block_job_complete_sync() call if job->completed is true. >> >>> block_job_complete_sync(job, errp); >>> aio_context_release(aio_context); >> Then, there are three more issues I found while reviewing this patch: >> >> First, if the block job is completed before block_job_complete_sync() is >> called (i.e. if an error occurred), it is automatically freed. This is >> bad because this means we'll have some instances of use-after-free here. >> Therefore, we need to invoke block_job_ref() before run_block_job() and >> block_job_unref() afterwards. (And since these functions are currenctly >> static in blockjob.c, we'll have to make them global.) >> >> Secondly, run_block_job() doesn't evaluate job->ret. Therefore it will >> report success even if the commit failed (it is expecting >> block_job_complete_sync() to put an error into errp, but it will not do >> that). So we'll have to do that (manually check job->ret and if it's >> negative, put an error message into errp; also, assert that >> job->cancelled is false). >> >> Thirdly, we have segfault in bdrv_reopen_prepare() if the image has >> non-string options... I'll handle this one. >> >> I can also handle the other two issues, if you'd like me to. >> >> >> Finally, an iotest would be nice (see my reproducer above). But I can >> handle that as well, if you decide not to write one. >> >> Max >>