On 25/02/2016 11:10, Alex Pyrgiotis wrote: > All normal regions in a QEMUSGList point to an address range in the > guest's RAM. The MMIO regions of QEMU's virtual devices, however, do not > correspond to such an address range, so QEMU must create a bounce buffer > to represent them. This bounce buffer is added in the I/O vector which > contains the rest of the mapped addresses and is later given to a > readv()/writev() call.
Correct. >>> 9. This leads to a partial read/write and the mapping loop will resume >>> once the partial read/write() has finished. > > The MMIO region is the trigger for a partial read/write, but it's not > the actual reason. The actual reason is that there is only *one* > *global* bounce buffer. This means that if it's in use it or we > need to use it twice, we will have to wait. Yes. >>>> However, it is not possible to do the same for ioctls. This is actually >>>> the reason why no one has ever tried to make scsi-generic do anything >>>> but bounce-buffering. I think that your code breaks horribly in this >>>> case, and I don't see a way to fix it, except for reverting to bounce >>>> buffering. > > Sure, you're right, there's no sensible way to break an ioctl() > operation in many. However, I'd argue that we shouldn't need to, as it > would be much better if the DMA operation was never restarted in the > first place. Instead, if we dealt with the bigger issue of the global > bounce buffer, we could kill two birds with one stone. > > I see that there was an attempt [1] to replace the global bounce buffer > with something more dynamic. In short, the objections [2] were the > following: > > 1. It introduced locking/unlocking a global mutex in the hot path as > well as a hash table lookup. > 2. It allowed for unbounded memory allocations. > > An improvement that would address (1) is to get rid of any global state: > > Since the mapping operation takes place in the context of a DMA > operation, we could provide a ctx-type struct to the dma_memory_(un)map > --> address_space_(un)map functions that would contain a hash table. If > any memory allocations were needed, we would track them using that hash > table, which would require no locks. Moreover, if the initialization of > the hash table hurts the performance in the general case, we could use > instead a skip list, if the number of memory allocations is small (e.g. > < 100). You don't need a hash table either if you manage the bounce buffer list per DMA transfer, and the simplest way to achieve that is to move the bounce buffer from exec.c to dma-helpers.c entirely. The patch could first introduce address_space_map_direct that never uses the bounce buffer. dma-helpers.c can call address_space_map_direct and, if it fails, proceed to allocate (and fill if writing to the device) a bounce buffer. Since the QEMUSGList is mapped and unmapped beginning-to-end, you can just use a FIFO queue. The FIFO queue stores a (QEMUSGList, buffer) tuple. When unmapping a QEMUSGList you check if it matches the head of the queue; if it does, you write back the contents of the bounce buffer (for reads from the device) and free it. If it doesn't match, you call address_space_unmap. Then, once the bounce buffer is implemented within dma-helpers.c, you remove address_space_map and rename address_space_map_direct to address_space_map. cpu_register_map_client goes away. The unbounded memory allocation can be avoided by bounding the number of entries in the queue. In the case of scsi-generic you could just as well allow INT_MAX entries, because scsi-generic would do unbounded memory allocation anyway for the bounce buffer. Modulo the "& BDRV_SECTOR_MASK" issue, this actually seems simpler than what this series was doing. Paolo