I guess the array-based replication is much different and I’m guessing I
wouldn’t be seeing these kinds of issues if we were using it.  Sometime
soon the Linux Admins are planning on using VSR on some of their VMs, so it
will be interesting to see if they have any problems.  I should test this
without any quiesing (therefore no VSS involved) just to see if that makes
a difference.



Here are some examples of Event Log errors that were happening only while
replication was enabled.  Every single VM had at least some of the
following:



System Event ID 129 – LSI_SAS – Reset to device, \Device\RaidPort0, was
issued

System Event ID 7011 – Service Control Manager – A timeout (30000
milliseconds) was reached while waiting for a transaction response from the
VMTools service.

System Event ID 8 – volsnap – The flush and hold writes operation on volume
<X> timed out while waiting for a release writes command.  (Got this on all
volumes.)

System Event ID 5012 – WAS – A process serving application pool
‘DefaultAppPool’ exceeded time limits during start up.  The process id was
‘xxx’. (Many different process IDs.)

System Event ID 57 – Ntfs – The system failed to flush data to the
transaction log……..

System Event ID 137 – Ntfs – The default transaction resource manager on
volume <XXXXX> encountered a non-retryable error…….



Application Event ID 18056 – MSSQL$<INSTANCE> - The client was unable to
reuse a session…. The failure ID is 29……

Application Event ID 1000 – Vmware Tools – [vmsvc:vmbackup] Failed to send
event to the VMX: Unknown command.

Application Event ID 1000 – Vmware Tools – [vmvss:vmvss]
CVmSnapshotRequestor:UnregisteredProvider…………….

Application Event ID 12298 – VSS – Volume Shadow Copy Service error: The
I/O writes cannot be held………..The volume index in the shadow copy set is 0.
………….

Application Event ID 12293 – VSS – Error calling a routine on Shadow Copy
Provider……

Application Event ID 12340 – VSS – VSS waited more than 40 seconds for all
volumes to be flushed……….

Application Event ID 24583 – SQLWRITER - …..Native Error: 3013…….

Application Event ID 1 SQLVDI – Loc=TriggerAbort……..

Application Event ID 12289 – VSS – Unexpected error DeviceIoControl(
\\?\fdc#generic_floppy_drive# <file:///\\%3f\fdc%23generic_floppy_drive%23>
...................).



Also, looking back 3 months in our server monitoring system, all of these
VMs had occasional ping drops during the duration of the replication, but
not before or after.  This is in line with the fact that users experienced
outages with the heavily used servers.















*From:* listsad...@lists.myitforum.com [mailto:
listsad...@lists.myitforum.com] *On Behalf Of *Sean Martin
*Sent:* Wednesday, September 24, 2014 7:49 PM
*To:* ntsysadm@lists.myitforum.com
*Subject:* Re: [NTSysADM] vSphere Replication Anyone?



I'm afraid I won't be of much help as I don't have any experience with
vSphere replication. However, I have quite a bit of experience with SRM
using array based replication and I have never encountered an issue like
that. I am curious about the link between vSphere replication and the event
log errors you idenitified. Please keep us updated if you identify the root
cause for these issues.



- Sean



On Wed, Sep 24, 2014 at 11:55 AM, Charles F Sullivan <
charles.sulliva...@bc.edu> wrote:

Has anyone used vSphere Replication (or Site Recovery Manager) to replicate
Windows VMs?  Over the summer we started to use this as a DR solution.
There were seven VMs replicated over fast links (1 GB, I believe) to a
location less than a mile away.  The replication for the most part was
working.  We were even able to bring down 4 source servers, bring up the
targets and test the services provided by those servers successfully.



The problem is that the SQL servers in particular had lots of different
Event Log errors related to VSS, NTFS, SQL, as well as vCenter Log errors.
It got to the point that the SQL servers were completely unresponsive on
multiple occasions.  Now that I’ve had time to look through more logs,
every single one of the seven Windows servers that we replicated had these
types of errors and had at least some network disruptions due to the
resource exhaustion.  Before and after replication, none of these problems
existed.  They are a mix of Windows 2003 and 2008 R2 VMs.  They are even
running on a couple of very disparate hosts (IBM Blade Centers and Cisco
UCS), but have the same issues regardless of that.



I have a case open with VMware, though so far I’m not having much luck.
I’m not really asking for help with this, but because this is affecting
100% of the VMs we’ve tried, I wanted to find out if anyone else has used
this solution without having these issues.  Overall, our VMware environment
is pretty healthy and we run several hundred VMs with little downtime.

Reply via email to