Am 22.12.2015 um 00:05 hat Max Reitz geschrieben: > On 21.12.2015 22:33, Denis V. Lunev wrote: > > On 12/21/2015 04:18 PM, Denis V. Lunev wrote: > >> On 12/07/2015 08:07 PM, Roman Kagan wrote: > >>> qcow2_get_specific_info() used to have a code path which would leave > >>> pointer to ImageInfoSpecificQCow2 uninitialized. > >>> > >>> We guess that it caused sporadic crashes on freeing an invalid pointer > >>> in response to "query-block" QMP command in > >>> visit_type_ImageInfoSpecificQCow2 with QapiDeallocVisitor. > >>> > >>> Although we have neither a solid proof nor a reproduction scenario, > >>> making sure the field is initialized appears a reasonable thing to do. > >>> > >>> Signed-off-by: Roman Kagan <[email protected]> > >>> --- > >>> block/qcow2.c | 2 +- > >>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>> > >>> diff --git a/block/qcow2.c b/block/qcow2.c > >>> index 88f56c8..67c9d3d 100644 > >>> --- a/block/qcow2.c > >>> +++ b/block/qcow2.c > >>> @@ -2739,7 +2739,7 @@ static ImageInfoSpecific > >>> *qcow2_get_specific_info(BlockDriverState *bs) > >>> *spec_info = (ImageInfoSpecific){ > >>> .type = IMAGE_INFO_SPECIFIC_KIND_QCOW2, > >>> - .u.qcow2 = g_new(ImageInfoSpecificQCow2, 1), > >>> + .u.qcow2 = g_new0(ImageInfoSpecificQCow2, 1), > >>> }; > >>> if (s->qcow_version == 2) { > >>> *spec_info->u.qcow2 = (ImageInfoSpecificQCow2){ > >> > >> It looks like I have caught this (on QEMU 2.3 RHEV) got from here > >> http://ftp.redhat.com/pub/redhat/linux/enterprise/7Server/en/RHEV/SRPMS/qemu-kvm-rhev-2.3.0-31.el7_2.3.src.rpm > >> > >> > >> Here is a dump > >> > >> > >> (gdb) p *blk_backends.tqh_first->bs > >> $12 = {total_sectors = 134217728, read_only = 0, open_flags = 24802, > >> encrypted = 0, valid_key = 0, sg = 0, copy_on_read = 0, probed = false, > >> drv = 0x7fe0c047ec40 <bdrv_qcow2>, opaque = 0x7fe0c23acdb0, > >> blk = 0x7fe0c23a4640, aio_context = 0x7fe0c237f350, aio_notifiers = { > >> lh_first = 0x0}, > >> filename = "/vz/vmprivate/VM_a80bf3e5.pvm/harddisk.hdd", '\000' > >> <repeats 4053 times>, backing_file = '\000' <repeats 4095 times>, > >> backing_format = '\000' <repeats 15 times>, > >> full_open_options = 0x7fe0c2513e30, > >> exact_filename = "/vz/vmprivate/VM_a80bf3e5.pvm/harddisk.hdd", > >> '\000' <repeats 4053 times>, backing_hd = 0x0, file = 0x7fe0c23a9a50, > >> close_notifiers = { > >> notifiers = {lh_first = 0x0}}, before_write_notifiers = {notifiers > >> = { > >> lh_first = 0x0}}, serialising_in_flight = 0, throttle_state = > >> {cfg = { > >> buckets = {{avg = 0, max = 0, level = 0}, {avg = 0, max = 0, > >> level = 0}, > >> {avg = 0, max = 0, level = 0}, {avg = 0, max = 0, level = 0}, { > >> avg = 0, max = 0, level = 0}, {avg = 0, max = 0, level = 0}}, > >> op_size = 0}, previous_leak = 0, timers = {0x0, 0x0}, > >> clock_type = QEMU_CLOCK_REALTIME, read_timer_cb = 0x0, > >> write_timer_cb = 0x0, timer_opaque = 0x0}, throttled_reqs = > >> {{entries = { > >> tqh_first = 0x0, tqh_last = 0x7fe0c23a7810}}, {entries = { > >> tqh_first = 0x0, tqh_last = 0x7fe0c23a7820}}}, > >> io_limits_enabled = false, stats = {nr_bytes = {0, 0, 0}, nr_ops = > >> {0, 0, > >> 0}, total_time_ns = {0, 0, 0}, merged = {0, 0, 0}, > >> wr_highest_sector = 0}, bl = {max_discard = 0, discard_alignment = 0, > >> max_write_zeroes = 0, write_zeroes_alignment = 128, > >> opt_transfer_length = 0, max_transfer_length = 0, > >> opt_mem_alignment = 512}, zero_beyond_eof = true, > >> request_alignment = 512, > >> guest_block_size = 512, enable_write_cache = 1, > >> on_read_error = BLOCKDEV_ON_ERROR_REPORT, > >> on_write_error = BLOCKDEV_ON_ERROR_ENOSPC, iostatus_enabled = true, > >> iostatus = BLOCK_DEVICE_IO_STATUS_OK, node_name = '\000' <repeats 31 > >> times>, > >> node_list = {tqe_next = 0x0, tqe_prev = 0x0}, device_list = { > >> tqe_next = 0x7fe0c2514e50, tqe_prev = 0x7fe0c047cff0 <bdrv_states>}, > >> dirty_bitmaps = {lh_first = 0x0}, refcnt = 1, tracked_requests = { > >> lh_first = 0x0}, op_blockers = {{lh_first = 0x0} <repeats 15 times>}, > >> job = 0x0, options = 0x7fe0c23a2ca0, > >> detect_zeroes = BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF, backing_blocker > >> = 0x0, > >> write_threshold_offset = 0, write_threshold_notifier = {notify = 0x0, > >> node = {le_next = 0x0, le_prev = 0x0}}} > >> > >> > >> (gdb) p *(BDRVQcowState*)blk_backends.tqh_first->bs->opaque > >> $13 = {cluster_bits = 0, cluster_size = 0, cluster_sectors = 0, > >> l2_bits = 0, > >> l2_size = 0, l1_size = 0, cluster_offset_mask = 0, l1_table_offset = 0, > >> l1_table = 0x0, l2_cache = 0x0, l2_cache_offsets = {0 <repeats 16 > >> times>}, > >> l2_cache_counts = {0 <repeats 16 times>}, cluster_cache = 0x0, > >> cluster_data = 0x0, cluster_cache_offset = 0, crypt_method = 0, > >> crypt_method_header = 0, aes_encrypt_key = {rd_key = {0 <repeats 60 > >> times>}, > >> rounds = 0}, aes_decrypt_key = {rd_key = {0 <repeats 60 times>}, > >> rounds = 0}, lock = {locked = false, queue = {entries = {tqh_first > >> = 0x0, > >> tqh_last = 0x0}}}, migration_blocker = 0x0} > >> (gdb) > >> > >> > >> you see, all parameters are zeroes. > >> > >> Technically this could happen in the middle of the qcow2_invalidate_cache > >> which, in turn is called from migration code f.e. during resuming from > >> suspend (which is how this thing is implemented in libvirt). > >> > >> Migration code is running actually at the moment as far as could be seen > >> in coredump. > >> > >> Any thoughts? :) > >> > >> Den > >> > > The situation is even worse than I have feared. > > > > 1) the hypothesis is correct. We are in the middle of the > > qcow2_invalidate_cache > > which is called from here > > static void process_incoming_migration_co(void *opaque) > > { > > ... > > /* Make sure all file formats flush their mutable metadata */ > > bdrv_invalidate_cache_all(&local_err); > > > > 2) Though the code is problematic, coroutine context is switched inside, > > there are a lot of places doing IO inside. > > > > 3) Incoming request for qmp_query_block is called, it is safely performed > > from monitor context > > > > diff --git a/block/qapi.c b/block/qapi.c > > index 8a19aed..8a07d75 100644 > > --- a/block/qapi.c > > +++ b/block/qapi.c > > @@ -368,7 +368,12 @@ BlockInfoList *qmp_query_block(Error **errp) > > > > for (blk = blk_next(NULL); blk; blk = blk_next(blk)) { > > BlockInfoList *info = g_malloc0(sizeof(*info)); > > + AioContext *ctx = blk_get_aio_context(blk); > > + > > + aio_context_acquire(ctx); > > bdrv_query_info(blk, &info->value, &local_err); > > + aio_context_release(ctx); > > + > > if (local_err) { > > error_propagate(errp, local_err); > > goto err; > > > > will not help us as the thread is the same! We will increase > > nesting in rfifolock_lock() and proceed inside to crash. > > > > 4) locking with coroutine lock in qcow2_get_specific_info > > will not help too - it is initialized in qcow2_open inside > > invalidate cache and is taken/dropped too several > > time > > > > :((( > > Thanks for finding this! > > Well, if qcow2_get_specific_info() is the only place that can actually > cause issues in that case (i.e. calling some QMP function which uses the > qcow2 image while the incoming_migration coroutine yields before the > image has been fully reopened), I think the simplest way to fix this > would be to just return NULL from qcow2_get_specific_info() in the else > branch (which currently aborts), adding a comment how we can end up there. > > However, it seems hard to believe this is the only problematic path... > If the coroutine can yield between the BDRVQcow2State getting memset() > to 0 and qcow2_open() having initialized it again, then any QMP command > which makes use of the qcow2 image should fail (not necessarily > gracefully) at that point. > > I feel like qcow2_invalidate_cache() is the root cause. Basically > completely reopening the qcow2 image just feels wrong. Maybe we could at > least atomically replace the BDRVQcow2State, i.e. call qcow2_open() > first (but with a new BDRVQcow2State), then swap out the BDRVQcow2State > in the BDS, and then call qcow2_close() on the old BDRVQcow2State. That > doesn't feel right still, but it should fix this issue, at least.
Yes, there is room for improvement in qcow2_invalidate_cache() and we should probably do something about it, but I believe the root cause is that the monitor can run commands while we're in the middle of completing a migration. I can imagine "interesting" problems outside the block layer, with half-loaded device state and similar things. Now Paolo's commit 82a4da7 explicitly moved the code to a coroutine in order to make it non-blocking, so he must have put some thought into whether this is safe or not. Maybe something else has changed since then that changed the situation. Paolo, any ideas? Kevin
pgpyduOmNuuw0.pgp
Description: PGP signature
