I guess the array-based replication is much different and I’m guessing I wouldn’t be seeing these kinds of issues if we were using it. Sometime soon the Linux Admins are planning on using VSR on some of their VMs, so it will be interesting to see if they have any problems. I should test this without any quiesing (therefore no VSS involved) just to see if that makes a difference.
Here are some examples of Event Log errors that were happening only while replication was enabled. Every single VM had at least some of the following: System Event ID 129 – LSI_SAS – Reset to device, \Device\RaidPort0, was issued System Event ID 7011 – Service Control Manager – A timeout (30000 milliseconds) was reached while waiting for a transaction response from the VMTools service. System Event ID 8 – volsnap – The flush and hold writes operation on volume <X> timed out while waiting for a release writes command. (Got this on all volumes.) System Event ID 5012 – WAS – A process serving application pool ‘DefaultAppPool’ exceeded time limits during start up. The process id was ‘xxx’. (Many different process IDs.) System Event ID 57 – Ntfs – The system failed to flush data to the transaction log…….. System Event ID 137 – Ntfs – The default transaction resource manager on volume <XXXXX> encountered a non-retryable error……. Application Event ID 18056 – MSSQL$<INSTANCE> - The client was unable to reuse a session…. The failure ID is 29…… Application Event ID 1000 – Vmware Tools – [vmsvc:vmbackup] Failed to send event to the VMX: Unknown command. Application Event ID 1000 – Vmware Tools – [vmvss:vmvss] CVmSnapshotRequestor:UnregisteredProvider……………. Application Event ID 12298 – VSS – Volume Shadow Copy Service error: The I/O writes cannot be held………..The volume index in the shadow copy set is 0. …………. Application Event ID 12293 – VSS – Error calling a routine on Shadow Copy Provider…… Application Event ID 12340 – VSS – VSS waited more than 40 seconds for all volumes to be flushed………. Application Event ID 24583 – SQLWRITER - …..Native Error: 3013……. Application Event ID 1 SQLVDI – Loc=TriggerAbort…….. Application Event ID 12289 – VSS – Unexpected error DeviceIoControl( \\?\fdc#generic_floppy_drive# <file:///\\%3f\fdc%23generic_floppy_drive%23> ...................). Also, looking back 3 months in our server monitoring system, all of these VMs had occasional ping drops during the duration of the replication, but not before or after. This is in line with the fact that users experienced outages with the heavily used servers. *From:* listsad...@lists.myitforum.com [mailto: listsad...@lists.myitforum.com] *On Behalf Of *Sean Martin *Sent:* Wednesday, September 24, 2014 7:49 PM *To:* ntsysadm@lists.myitforum.com *Subject:* Re: [NTSysADM] vSphere Replication Anyone? I'm afraid I won't be of much help as I don't have any experience with vSphere replication. However, I have quite a bit of experience with SRM using array based replication and I have never encountered an issue like that. I am curious about the link between vSphere replication and the event log errors you idenitified. Please keep us updated if you identify the root cause for these issues. - Sean On Wed, Sep 24, 2014 at 11:55 AM, Charles F Sullivan < charles.sulliva...@bc.edu> wrote: Has anyone used vSphere Replication (or Site Recovery Manager) to replicate Windows VMs? Over the summer we started to use this as a DR solution. There were seven VMs replicated over fast links (1 GB, I believe) to a location less than a mile away. The replication for the most part was working. We were even able to bring down 4 source servers, bring up the targets and test the services provided by those servers successfully. The problem is that the SQL servers in particular had lots of different Event Log errors related to VSS, NTFS, SQL, as well as vCenter Log errors. It got to the point that the SQL servers were completely unresponsive on multiple occasions. Now that I’ve had time to look through more logs, every single one of the seven Windows servers that we replicated had these types of errors and had at least some network disruptions due to the resource exhaustion. Before and after replication, none of these problems existed. They are a mix of Windows 2003 and 2008 R2 VMs. They are even running on a couple of very disparate hosts (IBM Blade Centers and Cisco UCS), but have the same issues regardless of that. I have a case open with VMware, though so far I’m not having much luck. I’m not really asking for help with this, but because this is affecting 100% of the VMs we’ve tried, I wanted to find out if anyone else has used this solution without having these issues. Overall, our VMware environment is pretty healthy and we run several hundred VMs with little downtime.