[
https://issues.apache.org/jira/browse/CLOUDSTACK-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brenn Oosterbaan updated CLOUDSTACK-7319:
-----------------------------------------
Fix Version/s: 4.4.0
> Copy Snapshot command too heavy on XenServer Dom0 resources when using dd to
> copy incremental snapshots
> -------------------------------------------------------------------------------------------------------
>
> Key: CLOUDSTACK-7319
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7319
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Snapshot, XenServer
> Affects Versions: 4.0.0, 4.0.1, 4.0.2, 4.1.0, 4.1.1, 4.2.0, Future, 4.2.1,
> 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1
> Reporter: Joris van Lieshout
> Assignee: Brenn Oosterbaan
> Priority: Critical
> Fix For: 4.4.0
>
>
> We noticed that the dd process was way to agressive on Dom0 causing all kinds
> of problems on a xenserver with medium workloads.
> ACS uses the dd command to copy incremental snapshots to secondary storage.
> This process is to heavy on Dom0 resources and even impacts DomU performance,
> and can even lead to domain freezes (including Dom0) of more then a minute.
> We've found that this is because the Dom0 kernel caches the read and write
> operations of dd.
> Some of the issues we have seen as a consequence of this are:
> - DomU performance/freezes
> - OVS freeze and not forwarding any traffic
> - Including LACPDUs resulting in the bond going down
> - keepalived heartbeat packets between RRVMs not being send/received
> resulting in flapping RRVM master state
> - Braking snapshot copy processes
> - the xenserver heartbeat script reaching it's timeout and fencing the server
> - poolmaster connection loss
> - ACS marking the host as down and fencing the instances even though they are
> still running on the origional host resulting in the same instance running on
> to hosts in one cluster
> - vhd corruption are a result of some of the issues mentioned above
> We've developed a patch on the xenserver scripts
> /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input
> and output files (iflag=direct oflag=direct).
> Our test have shown that Dom0 load during snapshot copy is way lower.
--
This message was sent by Atlassian JIRA
(v6.2#6252)