On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> (With the previous atapi_dma flag recovery) If migration happens between the ATAPI command being written and the bmdma being started, the DMA is dropped. Eventually the guest times out and recovers, but that can take many seconds. (This is rare, on a pingpong reading the CD continuously I hit this about ~1/30-1/50 migrates) I don't think we've got enough state to be able to recover safely at this point, so I throw a 'medium error, no seek complete' that I'm assuming guests will try and recover from an apparently dirty CD. OK, it's a hack, the real solution is probably to push a lot of ATAPI state into the migration stream, but this is a fix that works with no stream changes. Tested only on Linux (both RHEL5 (pre-libata) and RHEL7). Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com> --- hw/ide/atapi.c | 17 +++++++++++++++++ hw/ide/internal.h | 2 ++ hw/ide/pci.c | 11 +++++++++++ 3 files changed, 30 insertions(+) diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c index c63b7e5..e17799c 100644 --- a/hw/ide/atapi.c +++ b/hw/ide/atapi.c @@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors, } } + +/* Called by *_restart_bh when the transfer function points + * to ide_atapi_cmd + */ +void ide_atapi_dma_restart(IDEState *s) +{ + /* + * I'm not sure we have enough stored to restart the command + * safely, so give the guest an error it should recover from. + * I'm assuming most guests will try to recover from something + * listed as a medium error on a CD; it seems to work on Linux. + * This would be more of a problem if we did any other type of + * DMA operation. + */ + ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE); +} +
Is this safe for non-data commands? Can we even get there in such a case?
static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index, uint16_t profile) { diff --git a/hw/ide/internal.h b/hw/ide/internal.h index 8a3eca4..8b65285 100644 --- a/hw/ide/internal.h +++ b/hw/ide/internal.h @@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps; #define ATAPI_INT_REASON_TAG 0xf8 /* same constants as bochs */ +#define ASC_NO_SEEK_COMPLETE 0x02 #define ASC_ILLEGAL_OPCODE 0x20 #define ASC_LOGICAL_BLOCK_OOR 0x21 #define ASC_INV_FIELD_IN_CMD_PACKET 0x24 @@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s); void ide_atapi_cmd_ok(IDEState *s); void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc); +void ide_atapi_dma_restart(IDEState *s); void ide_atapi_io_error(IDEState *s, int ret); void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val); diff --git a/hw/ide/pci.c b/hw/ide/pci.c index bee5ad3..e3f2054 100644 --- a/hw/ide/pci.c +++ b/hw/ide/pci.c @@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque) } } else if (error_status & IDE_RETRY_FLUSH) { ide_flush_cache(bmdma_active_if(bm)); + } else { + IDEState *s = bmdma_active_if(bm); + + /* + * We've not got any bits to tell us about ATAPI - but + * we do have the end_transfer_func that tells us what + * we're trying to do. + */ + if (s->end_transfer_func == ide_atapi_cmd) { + ide_atapi_dma_restart(s); + }
OK, so when the restart routines get invoked we add a hook to see if we were in the middle of an ATAPI command and acknowledge that we don't know how to properly handle this.
Isn't this going to run on every vmstate change, though? I think we don't clear out end_transfer_func on success, so this might fire off more than we want it to, although I guess end_transfer_func is usually going to get set to ide_atapi_cmd_reply_end if it finishes normally ...
} }
Indeed a hack, but it's probably appropriate: if our code cannot in fact handle ATAPI migration, throwing an error or disabling migration is the correct thing to do, but I don't think users would be very happy with the second option. I feel that this is an OK workaround because it should not introduce spurious errors or retries for cases where we manage to avoid migrating in the middle of the loop. This will at least let the currently broken case limp along until we fix it more properly.
What makes me the most curious is how this plays out in Windows if this case is triggered. Throw a trace around the fake error and see if you can't observe it getting called during a pingpong test while Windows reads a CD.