I'm believe your patch set provides these behaviors now: * QEMU block drivers report discard_granularity. * discard_granularity = 0 means no discard * The guest is told there's no discard support. * discard_granularity < 0 is undefined. discard_granularity > 0 is reported to the guest as discard support. * QEMU block drivers report discard_zeros_data. This is passed to the guest when discard_granularity > 0.
I propose adding the following behaviors in any event: * If a QEMU block device reports a discard_granularity > 0, it must be equal to 2^n (n >= 0), or QEMU's block core will change it to 0. (Non-power-of-two granularities are not likely to exist in the real world, and this assumption greatly simplifies ensuring correctness.) * For SCSI, report an unmap_granularity to the guest as follows: max(logical_block_size, discard_granularity) / logical_block_size Regarding emulating discard_zeros_data... I agree that when discard_zeros_data is set, we will need to write zeroes in some cases. As you noted, IDE has a fixed granularity of one sector. And the SCSI granularity is a hint only; guests are not guaranteed to align to that value either. [0] As a design concept, instead of guaranteeing that 512B zero'ing discards are supported, I think the QEMU block layer should instead guarantee aligned discards to QEMU block devices, emulating any misaligned discards (or portions thereof) by writing zeroes if (and only if) discard_zeros_data is set. When the QEMU block layer gets a discard: * Of the specified discard range, see if it includes an aligned multiple of discard granularity. If so, save that as the starting point of a subrange. Then find the last aligned multiple, if any, and pass that subrange (if start != end) down to the block driver's discard function. * If the discard really fails (i.e. returns failure and sets errno to something other than "not supported" or equivalent), return failure to the guest. For "not supported", fall through to the code below with the full range. * At this point, we have zero, one, or two subranges to handle. * If and only if discard_zeros_data is set, write zeros to the remaining subranges, if any. (This would use a lower-level write_zeroes call which does not attempt to use discard.) If this fails, return failure to the guest. * Return success. This leaves one remaining issue: In raw-posix.c, for files (i.e. not devices), I assume you're going to advertise discard_granularity=1 and discard_zeros_data=1 when compiled with support for fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually guarantees that it zeros the data when punching holes. I haven't verified this. If the guest does a big discard (think mkfs) and fallocate() returns EOPNOTSUPP, you'll have to zero essentially the whole virtual disk, which, as you noted, will also allocate it (unless you explicitly check for holes). This is bad. It can be avoided by not advertising discard_zeros_data, but as you noted, that's unfortunate. If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not going to work. This would side step these problems. You said it wasn't possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing by extending the file by one byte and then punching that: char buf = 0; fstat(s->fd, &st); pwrite(s->fd, &buf, 1, st.st_size + 1); has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, st.st_size + 1, 1); ftruncate(s->fd, st.st_size); [0] See the last paragraph starting on page 8: http://mkp.net/pubs/linux-advanced-storage.pdf -- Richard
signature.asc
Description: This is a digitally signed message part