> Let me elaborate: Yes, a cluster size generally means that it is most > “efficient” to access the storage at that size. But there’s a tradeoff. > At some point, reading the data takes sufficiently long that reading a > bit of metadata doesn’t matter anymore (usually, that is).
Any network storage suffers from long network latencies, so it always matters if you do more IOs than necessary. > There is a bit of a problem with making the backup copy size rather > large, and that is the fact that backup’s copy-before-write causes guest > writes to stall. So if the guest just writes a bit of data, a 4 MB > buffer size may mean that in the background it will have to wait for 4 > MB of data to be copied.[1] We use this for several years now in production, and it is not a problem. (Ceph storage is mostly on 10G (or faster) network equipment). > Hm. OTOH, we have the same problem already with the target’s cluster > size, which can of course be 4 MB as well. But I can imagine it to > actually be important for the target, because otherwise there might be > read-modify-write cycles. > > But for the source, I still don’t quite understand why rbd has such a > problem with small read requests. I don’t doubt that it has (as you > explained), but again, how is it then even possible to use rbd as the > backend for a guest that has no idea of this requirement? Does Linux > really prefill the page cache with 4 MB of data for each read? No idea. I just observed that upstream qemu backups with ceph are quite unusable this way.