On 18/12/2018 18:28, Oliver Freyermuth wrote:
We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for 
about half a year now with daily snapshots. But all of these VMs have very 
"low" I/O,
since we put anything I/O intensive on bare metal (but with automated 
provisioning of course).

So I'll chime in on your question, especially since there might be VMs on our 
cluster in the future where the inner OS may not be running an agent.
Since we did not observe this yet, I'll also add: What's your "scale", is it 
hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?

5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); usually not that high, but it can easily peak at 100% when certain things happen. We don't have great I/O performance (RBD over 1gbps links to HDD OSDs).

I'm poring through monitoring graphs now and I think the issue this time around was just too much dirty data in the page cache of a guest. The VM that failed spent 3 minutes flushing out writes to disk before its I/O was quiesced, at around 100 IOPS throughput (the actual data throughput was low, though, so small writes). That exceeded our timeout and then things went south from there.

I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior I'm seeing that seems to be the case. Unfortunately coming up with an upper bound for the freeze time seems tricky now. I'm increasing our timeout to 15 minutes, we'll see if the problem recurs.

Given this, it makes even more sense to just avoid the freeze if at all reasonable. There's no real way to guarantee that a fsfreeze will complete in a "reasonable" amount of time as far as I can tell.

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to