Hi Den, the long email about IDE stuff: On 11/30/2017 07:01 AM, Denis V. Lunev wrote: > On 11/29/2017 02:50 AM, John Snow wrote: >> >> On 11/28/2017 07:10 AM, Denis V. Lunev wrote: >>> There is the following crash reported from the field in QEMU 2.9: >>> bdrv_inc_in_flight (bs=bs@entry=0x0) >>> blk_aio_prwv >>> blk_aio_preadv >>> ide_buffered_readv >>> cd_read_sector >> Is ide_atapi_cmd_reply_end missing from the call stack here for some >> reason? ide_data_readw /should/ have end_transfer_func set to that if it >> was processing an ATAPI command; otherwise it shouldn't be able to get >> here under normal circumstances... > this is my fault. I have lost this line while removing unnecessary into > from callstack. Here if the full one. >
Well, at least we're back on terra firma here. I'm going to talk myself through what the problem looks like here, and I have some questions for you at the bottom of the email, so you can just skim if you're not interested in the details. > (gdb) bt > #0 bdrv_inc_in_flight (bs=bs@entry=0x0) at block/io.c:561 > #1 0x000055b224160d37 in blk_aio_prwv (blk=0x55b2265405a0, > offset=offset@entry=285122560, bytes=2048, > qiov=qiov@entry=0x55b22a31de20, > co_entry=co_entry@entry=0x55b224160400 <blk_aio_read_entry>, > flags=flags@entry=0, cb=cb@entry=0x55b224048e00 > <ide_buffered_readv_cb>, > opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1151 > #2 0x000055b224160db5 in blk_aio_preadv (blk=<optimized out>, > offset=offset@entry=285122560, qiov=qiov@entry=0x55b22a31de20, > flags=flags@entry=0, cb=cb@entry=0x55b224048e00 > <ide_buffered_readv_cb>, > opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1256 > #3 0x000055b22404bd8d in ide_buffered_readv (s=s@entry=0x55b22a712a68, > sector_num=556880, iov=iov@entry=0x55b22a712d60, > nb_sectors=nb_sectors@entry=4, > cb=cb@entry=0x55b22404f2a0 <cd_read_sector_cb>, > opaque=opaque@entry=0x55b22a712a68) at hw/ide/core.c:637 > #4 0x000055b22404e2c1 in cd_read_sector (s=0x55b22a712a68) > at hw/ide/atapi.c:198 > #5 ide_atapi_cmd_reply_end (s=0x55b22a712a68) at hw/ide/atapi.c:272 > #6 0x000055b224049704 in ide_data_readw (opaque=<optimized out>, > addr=<optimized out>) at hw/ide/core.c:2263 > #7 0x000055b223edd7f0 in portio_read (opaque=0x55b22a836000, addr=0, > size=2) > at /usr/src/debug/qemu-2.9.0/ioport.c:180 > #8 0x000055b223ee8e3b in memory_region_read_accessor (mr=0x55b22a836000, > addr=0, value=0x7f843b5be7c0, size=2, shift=0, mask=65535, attrs=...) > at /usr/src/debug/qemu-2.9.0/memory.c:435 > #9 0x000055b223ee6369 in access_with_adjusted_size (addr=addr@entry=0, > value=value@entry=0x7f843b5be7c0, size=size@entry=2, > access_size_min=<optimized out>, access_size_max=<optimized out>, > access=access@entry=0x55b223ee8e10 <memory_region_read_accessor>, > mr=mr@entry=0x55b22a836000, attrs=attrs@entry=...) > at /usr/src/debug/qemu-2.9.0/memory.c:592 > #10 0x000055b223ee9d36 in memory_region_dispatch_read1 (attrs=..., size=2, > pval=0x7f843b5be7c0, addr=0, mr=0x55b22a836000) > at /usr/src/debug/qemu-2.9.0/memory.c:1238 > #11 memory_region_dispatch_read (mr=mr@entry=0x55b22a836000, > addr=addr@entry=0, pval=pval@entry=0x7f843b5be7c0, size=size@entry=2, > attrs=attrs@entry=...) at /usr/src/debug/qemu-2.9.0/memory.c:1269 > #12 0x000055b223e9bdb2 in address_space_read_continue ( > as=as@entry=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496, > ---Type <return> to continue, or q <return> to quit--- > attrs=..., attrs@entry=..., > buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, > len=len@entry=2, addr1=0, l=2, mr=0x55b22a836000) > at /usr/src/debug/qemu-2.9.0/exec.c:2844 > #13 0x000055b223e9be67 in address_space_read_full ( > as=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496, attrs=..., > buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, > len=2, len@entry=602521191) at /usr/src/debug/qemu-2.9.0/exec.c:2895 > #14 0x000055b223e9bfce in address_space_read (len=602521191, > buf=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, attrs=..., > addr=496, as=<optimized out>) > at /usr/src/debug/qemu-2.9.0/include/exec/memory.h:1718 > #15 address_space_rw (as=<optimized out>, addr=addr@entry=496, attrs=..., > attrs@entry=..., > buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, > len=len@entry=2, is_write=is_write@entry=false) > at /usr/src/debug/qemu-2.9.0/exec.c:2909 > #16 0x000055b223ee5271 in kvm_handle_io (count=512, size=2, > direction=<optimized out>, data=<optimized out>, attrs=..., port=496) > at /usr/src/debug/qemu-2.9.0/kvm-all.c:1828 > #17 kvm_cpu_exec (cpu=cpu@entry=0x55b229032000) > at /usr/src/debug/qemu-2.9.0/kvm-all.c:2058 > #18 0x000055b223ed1c22 in qemu_kvm_cpu_thread_fn (arg=0x55b229032000) > at /usr/src/debug/qemu-2.9.0/cpus.c:1119 > #19 0x00007f8443993e25 in start_thread (arg=0x7f843b5bf700) > at pthread_create.c:308 > #20 0x00007f84436c134d in clone () > at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > >>> ide_data_readw >> How do we get here? ide_is_pio_out ought to be false here; do you know >> what command was being processed? Do you have a reproducer? >> >> Knowing both the ATA and ATAPI commands under consideration here would >> be helpful. >> > > if you prefer, I can upload core/binary to some storage. > Here is the state. > This might be helpful for me to poke around at, but unfortunately it looks like I can't see the ATAPI command that was being processed from the trace given. I don't think there's a way for me to retrieve it from a core, either. > (gdb) p *s > $3 = {bus = 0x55b22a7129f0, unit = 0 '\000', drive_kind = IDE_CD, > cylinders = 0, heads = 0, sectors = 0, chs_trans = 0, nb_sectors = > 11588488, > mult_sectors = 16, identify_set = 1, > identify_data = "\300\205", '\000' <repeats 18 times>, "MQ0000 1", ' ' > <repeats 12 times>, "\003\000\000\002\004\000.2+5 EQUMD DVR-MO", ' ' > <repeats 28 times>, > "\000\000\001\000\000\003\000\000\000\000\000\000\a", '\000' <repeats 17 > times>, > "\a\000\a\000\003\000\264\000\264\000,\001\264\000\000\000\000\000\036\000\036", > '\000' <repeats 15 times>, "\036", '\000' <repeats 15 times>, "? ", > '\000' <repeats 333 times>, drive_serial = 1, > drive_serial_str = "QM00001", '\000' <repeats 13 times>, > drive_model_str = "QEMU DVD-ROM", '\000' <repeats 28 times>, wwn = 0, > feature = 0 '\000', error = 0 '\000', nsector = 2, sector = 0 '\000', > lcyl = 0 '\000', hcyl = 8 '\b', hob_feature = 0 '\000', > hob_nsector = 3 '\003', hob_sector = 0 '\000', hob_lcyl = 0 '\000', > hob_hcyl = 8 '\b', select = 32 ' ', status = 80 'P', lba48 = 0 '\000', status = BUSY_STAT nsector = ATAPI_INT_REASON_IO (set in ide_atapi_cmd_reply_end) byte_count_limit = 2048 (hcyl << 8) > blk = 0x55b2265405a0, version = "2.5+\000\000\000\000", events = { > eject_request = false, new_media = false}, sense_key = 0 '\000', > asc = 0 '\000', tray_open = false, tray_locked = false, > cdrom_changed = 0 '\000', packet_transfer_size = 55296, > elementary_transfer_size = 0, io_buffer_index = 2048, lba = 139220, packet_transfer_size = 55296, (27 cd sectors) elementary_transfer_size = 0, io_buffer_index = 2048, lba = 139220, or 0x021fd4 This is either an initial read request, or a rebuffering. In the normative case, initial reads look like this: IDE: ide_ioport_write ide_exec_cmd (0xA0) cmd_packet ide_transfer_start s->end_transfer_func = ide_atapi_cmd [data_ptr and data_end set to meaningful values] [Return to guest, await PIO write of scsi cdb] ide_data_writew ide_atapi_cmd [soft error if !blk->bs] cmd_read_cd ide_atapi_cmd_read ide_atapi_cmd_read_pio ide_atapi_cmd_reply_end cd_read_sector [segv if !blk->bs] AHCI: handle_reg_h2d_fis [scsi CDB is copied into s->io_buffer] ide_exec_cmd cmd_packet ide_transfer_start [data_ptr and data_end set to transfer scsi cdb] ahci_start_transfer [shortcut, no transfer] s->data_ptr set to s->data_end [*] ide_atapi_cmd * (but data_end likely is still != s->io_buffer) from here the callback is the same as above: we'll wind up at cd_read_sector. Neither of these cases match your traceback. If it was the initial buffering, it would have occurred synchronously with receipt of the SCSI CDB, but your traces place it as a consequence of an attempt to read data, so that makes me suspect it's a rebuffering attempt, and the data in s->io_buffer is... cd data? I guess? did you have a CD in here at one point? > cd_sector_size = 2048, atapi_dma = 0, acct = {bytes = 2048, So we're using PIO paths when applicable. The s->feature register being 0 seems to agree. > start_time_ns = 433417215764666, type = BLOCK_ACCT_READ}, pio_aiocb > = 0x0, > iov = {iov_base = 0x55b22a8ae000, iov_len = 2048}, qiov = { > iov = 0x55b22a712d50, niov = 1, nalloc = -1, size = 2048}, > buffered_requests = {lh_first = 0x0}, io_buffer_offset = 0, > io_buffer_size = 0, sg = {sg = 0x0, nsg = 0, nalloc = 0, size = 0, > dev = 0x0, as = 0x0}, req_nb_sectors = 0, > end_transfer_func = 0x55b22404e190 <ide_atapi_cmd_reply_end>, > data_ptr = 0x55b22a8ae800 "", data_end = 0x55b22a8ae800 "", > io_buffer = 0x55b22a8ae000 s->data_ptr = 0x55b22a8ae800 s->data_end = 0x55b22a8ae800 s->io_buffer = 0x55b22a8ae000 s->io_buffer_offset = 0 Uh, hm. This does not look like an in-progress PIO transfer of any kind... This says to me that ide_dummy_transfer_stop was called at some point and nothing has started up in its place. This seems to conflict with the trace, which looks like a rebuffering. This looks like... undefined behavior entirely. I am not sure I have a good guess as to what's happening. > ")8\"\t\031\"\232\vx.\311l\301\270o$\335\025\257\064F\271\rI,R\342\032\315\nԆV\341od\251\023>\024\370\323\060A\256\337\300텋\024\233\201U\221T^\202\303\036\"E%\262\230\367ξ\fW\302\360\277\347\334\022\253\a\025\216\rj\334z\355>)\230/\021\255%a^\306\001\032", ... hmm 0x29 56 34 09 19 0x29 is... almost a read command (0x28) but it isn't. 0x29 shouldn't map to anything, actually. So this must be data we've buffered, or something left over from god knows what. > > io_buffer_total_len = 131076, cur_io_buffer_offset = 0, > cur_io_buffer_len = 0, end_transfer_fn_idx = 0 '\000', > sector_write_timer = 0x55b22a45d5c0, irq_count = 0, ext_error = 0 '\000', > mdata_size = 0, mdata_storage = 0x0, media_changed = 0, > dma_cmd = IDE_DMA_READ, smart_enabled = 1 '\001', smart_autosave = 1 > '\001', > smart_errors = 0, smart_selftest_count = 0 '\000', > ---Type <return> to continue, or q <return> to quit--- > smart_selftest_data = 0x55b22a827000 "", ncq_queues = 0} > (gdb) > > In general, I can upload the binary and core to the location > you want. Send it to me in person. By the way. I am staring to > receive these crashes in number. 2-3 in a day BUT from one > specific host. This could be important. > >>> portio_read >>> memory_region_read_accessor >>> access_with_adjusted_size >>> memory_region_dispatch_read1 >>> memory_region_dispatch_read >>> address_space_read_continue >>> address_space_read_full >>> address_space_read >>> address_space_rw >>> kvm_handle_io >>> kvm_cpu_exec >>> qemu_kvm_cpu_thread_fn >>> start_thread >>> clone >> The thing that looks to be incongruent here is that we appear to be >> servicing a ATAPI reply; that reply is either the kind that uses >> preformatted data in a buffer, or the kind that buffers a read. >> >> If it buffers a read, it should require CHECK_READY which requires a medium. >> >> If it does not buffer a read, it should not be able to invoke >> cd_read_sector or any bdrv function from ide_data_readw. >> >> If it neither serves preformatted data nor buffers a read, it should not >> allow ide_data_readw to perform any action at all. > the guest could misbehave! This is not a reason to crash. > For example there is a race or something like this. What > if the guest will execute read from the device without > ANY sanity checks? QEMU should not crash. This is the point > and this is the approach taken in the original patch. > The stuff quoted above isn't participatory, it's mandatory. I don't know how we're bypassing CHECK_READY. I want to understand what's happening before I pepper in sanity checks. in 2.9.0, you have this code: if ((cmd->flags & CHECK_READY) && (!media_present(s) || !blk_is_inserted(s->blk))) { ide_atapi_cmd_error(s, NOT_READY, ASC_MEDIUM_NOT_PRESENT); return; } with media_present defined as: static inline int media_present(IDEState *s) { return !s->tray_open && s->nb_sectors > 0; } ... however, from your gdb output, we do have: tray_open = false nb_sectors = 11588488 but blk_is_inserted is still going to check for the presence of 'bs' attached to the Block Backend, so this check should still prevent us from executing a read command out of the gate. Is the guest executing a read and then are we racing with the removal of the bds? Do you expect a CD to be inserted on this guest? nb_sectors I would expect to be 0; based on ide_cd_change_cb: blk_get_geometry(s->blk, &nb_sectors); this should set nb_sectors to 0 if there's no blk->bs. I'm sort of lost, I can't really easily construct what exactly happened here. > >>> Indeed, the CDROM device without media has blk->bs == NULL. We should >>> check that the media is really available for the device like has been done >>> in SCSI code. >>> >>> May be the patch adds a bit more check than necessary, but this is not be >>> the problem. We should always stay on the safe side. >>> >>> Signed-off-by: Denis V. Lunev <d...@openvz.org> >>> CC: John Snow <js...@redhat.com> >>> CC: Kevin Wolf <kw...@redhat.com> >>> CC: Stefan Hajnoczi <stefa...@redhat.com> >>> --- >>> hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++---- >>> hw/ide/core.c | 4 ++-- >>> 2 files changed, 30 insertions(+), 6 deletions(-) >>> >>> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c >>> index c0509c8bf5..fa50c0ccf6 100644 >>> --- a/hw/ide/atapi.c >>> +++ b/hw/ide/atapi.c >>> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s) >>> >>> trace_cd_read_sector_sync(s->lba); >>> >>> + if (!blk_is_available(s->blk)) { >>> + ret = -ENOMEDIUM; >>> + goto fail; >>> + } >>> + >>> switch (s->cd_sector_size) { >>> case 2048: >>> ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS, >>> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s) >>> } >>> break; >>> default: >>> - block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ); >>> - return -EIO; >>> + ret = -EIO; >>> + goto fail; >>> } >>> >>> if (ret < 0) { >>> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s) >>> } >>> >>> return ret; >>> + >>> +fail: >>> + block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ); >>> + return ret; >>> } >>> >>> static void cd_read_sector_cb(void *opaque, int ret) >>> @@ -174,9 +183,15 @@ static void cd_read_sector_cb(void *opaque, int ret) >>> >>> static int cd_read_sector(IDEState *s) >>> { >>> + int err; >>> + >>> if (s->cd_sector_size != 2048 && s->cd_sector_size != 2352) { >>> - block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ); >>> - return -EINVAL; >>> + err = -EINVAL; >>> + goto fail; >>> + } >>> + if (!blk_is_available(s->blk)) { >>> + err = -ENOMEDIUM; >>> + goto fail; >>> } >>> >>> s->iov.iov_base = (s->cd_sector_size == 2352) ? >>> @@ -195,6 +210,10 @@ static int cd_read_sector(IDEState *s) >>> >>> s->status |= BUSY_STAT; >>> return 0; >>> + >>> +fail: >>> + block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ); >>> + return err; >>> } >>> >>> void ide_atapi_cmd_ok(IDEState *s) >>> @@ -404,6 +423,11 @@ static void ide_atapi_cmd_read_dma_cb(void *opaque, >>> int ret) >>> goto eot; >>> } >>> >>> + if (!blk_is_available(s->blk)) { >>> + ide_atapi_cmd_read_dma_cb(s, -ENOMEDIUM); >>> + return; >>> + } >>> + >> I'm not sure this is OK, because it's an error but not setting the sense >> code or raising an IRQ; it's only calling ide_set_inactive, so this >> might be a problem. Normally the error code from the data read is >> handled earlier in the call by ide_handle_rw_error which can set the >> proper codes. >> > this follows the approach used by the error handling in this call. > I this code is not made worse. We can hit to the same processing > with the different error upper. > >>> s->io_buffer_index = 0; >>> if (s->cd_sector_size == 2352) { >>> n = 1; >>> diff --git a/hw/ide/core.c b/hw/ide/core.c >>> index 471d0c928b..71780fc9d1 100644 >>> --- a/hw/ide/core.c >>> +++ b/hw/ide/core.c >>> @@ -758,7 +758,7 @@ static void ide_sector_read(IDEState *s) >>> >>> trace_ide_sector_read(sector_num, n); >>> >>> - if (!ide_sect_range_ok(s, sector_num, n)) { >>> + if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) >>> { >>> ide_rw_error(s); >>> block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ); >>> return; >>> @@ -1023,7 +1023,7 @@ static void ide_sector_write(IDEState *s) >>> >>> trace_ide_sector_write(sector_num, n); >>> >>> - if (!ide_sect_range_ok(s, sector_num, n)) { >>> + if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) >>> { >>> ide_rw_error(s); >>> block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_WRITE); >>> return; >>> >> Since it's not a new regression (this is being reported against 2.9) I >> am somewhat hesitant to rush it into 2.11-rc3 without a little more info. >> >> That said, here's a list of the ATAPI commands we service that either >> return or CAN return data, but do not enforce CHECK_READY: >> >> [ 0x03 ] = { cmd_request_sense, ALLOW_UA }, >> [ 0x12 ] = { cmd_inquiry, ALLOW_UA }, >> [ 0x46 ] = { cmd_get_configuration, ALLOW_UA }, >> [ 0x4a ] = { cmd_get_event_status_notification, ALLOW_UA }, >> [ 0x5a ] = { cmd_mode_sense, /* (10) */ 0 }, >> [ 0xbd ] = { cmd_mechanism_status, 0 }, >> >> These six all invoke ide_atapi_cmd_reply in some way (which allows the >> guest to begin reading the reply via PIO mechanisms if DMA is not set): >> >> cmd_request_sense: ide_atapi_cmd_reply(s, 18, max_len); >> cmd_inquiry: ide_atapi_cmd_reply(s, idx, max_len); >> cmd_get_configuration: ide_atapi_cmd_reply(s, len, max_len); >> cmd_get_event_status_notification: >> ide_atapi_cmd_reply(s, used_len, max_len); >> cmd_mode_sense: ide_atapi_cmd_reply(s, 16, max_len); >> ide_atapi_cmd_reply(s, 24, max_len); >> ide_atapi_cmd_reply(s, 30, max_len); >> cmd_mechanism_status: ide_atapi_cmd_reply(s, 8, max_len); >> >> ide_atapi_cmd_reply itself sets lba to -1, which should inform >> ide_atapi_cmd_reply_end not to attempt to buffer any new data. These >> *should* be safe. The reply sizes are also all small enough that they >> are almost certainly getting buffered in one single transfer without >> attempting to buffer more data. >> >> In the normative PIO case then, reads will consume from this buffer >> until empty, then we'll call ide_atapi_cmd_reply_end through >> end_transfer_func and hit the end of transfer logic. >> >> I'm not sure I see how this crash is happening; it doesn't look like >> this path allows for invoking the block read functions and everything >> else is guarded by NONDATA or CHECK_READY. >> >> Unless this is an interesting interaction with ide_atapi_cmd_reply >> setting the DRQ bit which may trick the pio read function to allow PIO >> reads to come in during this time? >> >> Hypothetically: >> >> cmd_packet sets end_transfer_func to ide_atapi_cmd so it can process >> further once that PIO is completed. (The PIO may be emulated >> synchronously, in the case of AHCI.) In the true PIO case, it may be >> asynchronous. >> >> Now, in the AHCI case, the guest has control back and the CDROM is now >> executing a command to, perhaps, read data via DMA. Then, >> simultaneously, the guest issues a PIO read and because the DRQ bit is >> set for DMA, the PIO read also goes through. >> >> This still requires a media, though... and a very broken guest doing >> something naughty on purpose. >> >> I can't quite connect the dots to see how this crash is possible in >> general... it'd help to have: >> >> - The ATAPI command being processed at the time >> - The IDE command being processed at the time >> - s->status >> >> as a minimum, and then perhaps optionally some of the ATAPI loop >> variables, like packet_transfer_size, elementary_transfer_size, and >> io_buffer_index. Or a reproducer! >> >> Sorry I'm not more help and I wrote too much again :( > imho this does not matter. Once again - the guest can misbehave. I never said that if a guest acting out of spec was the problem I wouldn't fix it. I just want to find the root cause. If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid applied to something I can't diagnose. Let's discuss this for 2.12 and I will keep trying to figure out what the root cause is. Some questions for you: (1) Is the guest Linux? Do we know why this one machine might be tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?) (2) Does the VM actually have a CDROM inserted at some point? Is it possible we're racing on some kind of eject or graph manipulation failure? (3) Is this using AHCI or IDE? If I can't figure it out within a week or so from here I'll just check in the band-aid with some /* FIXME */ comments attached. --js