On 9/11/19 12:21 PM, Eric Blake wrote:
> On 9/11/19 11:15 AM, Sergio Lopez wrote:
>> On creation, the export's AioContext is set to the same one as the
>> BlockBackend, while the AioContext in the client QIOChannel is left
>> untouched.
>>
>> As a result, when using data-plane, nbd_client_receive_next_request()
>> schedules coroutines in the IOThread AioContext, while the client's
>> QIOChannel is serviced from the main_loop, potentially triggering the
>> assertion at qio_channel_restart_[read|write].
>>
>> To fix this, as soon we have the export corresponding to the client,
>> we call qio_channel_attach_aio_context() to attach the QIOChannel
>> context to the export's AioContext. This matches with the logic in
>> blk_aio_attached().
>>
>> RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1748253
>> Signed-off-by: Sergio Lopez <s...@redhat.com>
>> ---
>>  nbd/server.c | 2 ++
>>  1 file changed, 2 insertions(+)
> 
> I'd like a second opinion from Kevin, but the description makes sense to
> me.  I'm happy to queue this through my NBD tree.
> 
> Reviewed-by: Eric Blake <ebl...@redhat.com>

I tried to test this patch, but even with it applied, I still got an
aio-context crasher by attempting an nbd-server-start, nbd-server-add,
nbd-server-stop (intentionally skipping the nbd-server-remove step) on a
domain using iothreads, with a backtrace of:

#0  0x00007ff09d070e35 in raise () from target:/lib64/libc.so.6
#1  0x00007ff09d05b895 in abort () from target:/lib64/libc.so.6
#2  0x000055dd03b9ab86 in error_exit (err=1, msg=0x55dd03d59fb0
<__func__.15769> "qemu_mutex_unlock_impl")
    at util/qemu-thread-posix.c:36
#3  0x000055dd03b9adcf in qemu_mutex_unlock_impl (mutex=0x55dd062d5090,
file=0x55dd03d59041 "util/async.c",
    line=523) at util/qemu-thread-posix.c:96
#4  0x000055dd03b93433 in aio_context_release (ctx=0x55dd062d5030) at
util/async.c:523
#5  0x000055dd03ac421b in bdrv_do_drained_begin (bs=0x55dd0673a2d0,
recursive=false, parent=0x0,
    ignore_bds_parents=false, poll=true) at block/io.c:428
#6  0x000055dd03ac4299 in bdrv_drained_begin (bs=0x55dd0673a2d0) at
block/io.c:434
#7  0x000055dd03aafb54 in blk_drain (blk=0x55dd06a3ec40) at
block/block-backend.c:1605
#8  0x000055dd03aae054 in blk_remove_bs (blk=0x55dd06a3ec40) at
block/block-backend.c:800
#9  0x000055dd03aad54a in blk_delete (blk=0x55dd06a3ec40) at
block/block-backend.c:420
#10 0x000055dd03aad7d6 in blk_unref (blk=0x55dd06a3ec40) at
block/block-backend.c:475
#11 0x000055dd03aecb68 in nbd_export_put (exp=0x55dd0726f920) at
nbd/server.c:1666
#12 0x000055dd03aec8fe in nbd_export_close (exp=0x55dd0726f920) at
nbd/server.c:1616
#13 0x000055dd03aecbf1 in nbd_export_close_all () at nbd/server.c:1689
#14 0x000055dd03748845 in qmp_nbd_server_stop (errp=0x7ffcdf3cb4e8) at
blockdev-nbd.c:233
...

Does that sound familiar to what you were seeing?  Does it mean we
missed another spot where the context is set incorrectly?

I'm happy to work with you on IRC for more real-time debugging of this
(I'm woefully behind on understanding how aio contexts are supposed to
work).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to